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Resume 



Ces dernieres annees ont vu le developpement dc techniques experimentales permettant I'analyse 
quantitative de systemes biologiques, dans des domaines qui vont de la neurobiologie a la bi- 
ologie moleculaire. Notre travail a pour but la description quantitative de tels systemes a 
travers dcs oTitils theoriques et numeriques issus de la physique statistique et du calcul des 

probabilitcs. 

Cette these s'articule en trois volets, ayant chacun pour but I'etude d'un systemc biophysique. 
Premierement, on se concentre sur I'infotaxie, un algorithme de recherche olfactive base sur 
une approche de theorie de Tinformation propose par Vergassola et collaborateurs en 2007: 
on en donne une formulation continue et on en caracterise les performances. 
Dans une deuxieme partie on etudie les experiences de micromanipulation a molecule unique, 
notamment celles de degrafFage mecanique de I'ADN, dont les traces experimentales sont 
sensibles a la sequence de I'ADN: on developpe un modclc detaille de la dynamique de ce type 
d'experience et ensuite on propose plusieurs algorithmes d'inference ayant pour objectif de 
caracteriser la sequence genetique. 

Finalement, on donne une description d'un algorithme qui permet I'inference des interactions 
entre neurones a partir d'enregistrements a electrodes multiples et on propose un logiciel 
integre qui permettra a la communaute des biologistes d'interpreter ces experiences a partir 
de cet algorithme. 
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Abstract 



During the past few years the development of experimental techniques has allowed the quan- 
titative analysis of biological systems ranging from neurobiology and molecular biology. This 
work focuses on the quantitative description of these systems by means of theoretical and 
numerical tools ranging from statistical physics to probability theory. 

This dissertation is divided in three parts, each of which has a different biological system as 
its focus. 

The first such system is Infotaxis, an olfactory search algorithm proposed by Vergassola et al. 
in 2007: we give a continuous formulation and we characterize its performances. 
Secondly we will focus on single- molecule experiments, especially unzipping of DNA molecules, 
whose experimental traces depend strongly on the DNA sequence: we develop a detailed model 
of the dynamics for this kind of experiments and then we propose several inference algorithm 
aiming at the characterization of the genetic sequence. 

The last section is devoted to the description of an algorithm that allows the inference of 
interactions between neurons given the recording of neural activity from multi-electrode ex- 
periments; we propose an integrated software that will allow the analysis of these data. 
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Introduction 



Probabilistic models 

Many systems encountered in quantitative biology are best described by probabilistic mod- 
els. There are essentially three reasons why a probabilistic model would be preferred: either 
the process is thermally activated, either experimental conditions cannot be controlled in full 
detail or there are many possible realizations of annealed disorder in some of the involved 
variables. 

Systems where the dynamics are thermally activated arc widespread at the macromolecular 
scale (sizes ranging 1 — 100 nm), because of this the dynamics of most systems from molecular 
biology will exhibit stochastic behavior. In this dissertation we will touch such systems in 
Part II while addressing DNA unzipping experiments. 

Many biological experiments are performed in conditions where several variables cannot be 
controlled in detail: organisms which are genetically identical will exhibit different pheno- 
types, conditions of the medium will vary. In Part I we will observe turbulence can have such 

an effect in the description of olfactory searches. 

Thirdly, many biological systems exhibit a characteristic which is similar to that of annealed 
disorder in condensed matter physics, that is, there are a number of variables which can be 
treated as random because they are drawn from an ensemble of possible realizations but do 
not change during experiments. Examples include DNA and RNA (where the variable is 
the genetic sequence), proteins (amino-acid content) and neural systems (interaction matrix). 
Such systems will be addressed in Parts II and III. 

A probabilistic model will assign a proability to the outcome of an experiment. As it is pos- 
sible to do this, the inverse problem can be of interest, that is we can assign a probability to 
a model or a set of parameters given the outcome of an experiment. This type of question is 
at the core of our thesis and of Bayesian inference. 
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Bayes' theorem 

Bayes' theorem was derived by Thomas Bayes and was only pubUshed posthumously in 1763 
[Bayes 63 , Bayes 58j . It is now regarded as one of the founding pillars of probability theory. 



By today's standards the name theorem is probably a misnomer since its derivation is a 
straightforward manipulation of the the definition of conditional probability: 

p(Am = (1) 

where P{A n B) is the probability of event A and B both happening. 

If we now switch A and B and redefine combine the two definitions we obtain the classical 
expression of Bayes' theorem: 

where P{A) is usually called the prior, P{B\A) likelihood function and P{A\B) posterior. 
The importance of this theorem in performing statistical inference can only be understated in 
fact, if one interprets A as the parameters of a model and B as the outcome of an experiment 
we can see how this theorem relates the predictive power of a model to the inference of the 
best model or set of parameters. By rewriting the model this way: 

, , , , X P(data|modeli)P(modeli) 

P(modeli data) = ^ ' , ^ (3) 

X]iP(data|modeli)P(modeli) ^ ^ 

Let us give an example to further clarify this statement. Let us suppose we have two coins: 
one fair and one which is biased with probability p of heads turning up. 
While it is straightforward to compute the outcome of an experiment knowing which coin we 
are handling: say two consecutive heads yield P(HH|fair = 1/4, we wish to know P(fair|HH). 
Thanks to Bayes' theorem this can be done in a straightforward manner: 

^(fe-|HH) = (4) 

The attentive reader will have noticed we have placed ourselves in a very specific situation: 
we know we only have two coins, and we know the bias of one of them. 

The problem of testing the hypotesis of whether a coin is biased or not in the most general 
conditions is a much more complicated one and is illuminating as to the limitations of Bayesian 
inference. 

Our toy example had the very compelling feature of defining naturally the prior distribution, 
that is P(modeli) was 1/2 for i = 1,2: both coins were equiprobable. How do we define priors 
for more general cases? 

Sometimes some general choices are available, for example one could the maximum entropy 
probability distribution with given characteristics such has a given support or a given expected 
value. However this is not always possible especially when the support of the distribution is 
unbounded. 

However if we consider successive experiments and we refine the posterior every time we expect 
the choice of prior to be unimportant asymptotically. 
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P(tiur|HH) 
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Figure 1: P(fair|HH) as a function of p. Note how the probabiUty is maximum when p 
vanishes and it's minimum and equal to 1/5 when the unfair coin always returns heads, that 
is when p = 1. 



Bayesian inference 



Bayesian inference is the iterative application of Bayes' theorem to update one's knowledge 
about a random variable which might be a parameter of our model. It is not the only form of 
statistical inference, but it has several characteristics which make it more desirable than other 
techniques such as frequentist inference, where the frequency is interpreted as a probability. 
First of all Bayesian inference will return a probability distribution, which in general contains 
a lot more information than an inferred value and a confidence interval. 
On the other hand, as we have said before, Bayesian inference can depend strongly on the 
choice of a prior distribution of which there might not always be a natural choice. 
Let us give an example where a Bayesian approach is much superior: a hunter is hunting 
with his dog, we can observe the position of the dog but we cannot observe the position of 
the hunter, we further know the that the dog to be located with a certain probability p in a 
radius r around the hunter. 

The frequentist approach would lead to the following reasonment: since I have observed the 
dog in a given position: the hunter is in a radius r around this position with probability p. 
However relies on several tacit assumptions: the isotropy of the distribution of the dog around 
the hunter, different directions need not be equiprobable, in fact the dog will prefer to be up- 
wind from the hunter; secondly the uniformity of the distribution of positions of the hunter 
regardless of where the dog is. 

To put it in a mathematical form the frequentist approach equates P{D\H) to P{H\D) ig- 
noring P{H), the prior or the distribution of the position of the hunter and ignoring that 
P{D\H) might depend on more than just the distance between the dog and the hunter. 
Another classical application of Bayesian inference is the computation of the number of false 
positive in a medical test: Let us suppose there is a very rare disease which occures only in a 
tiny fraction e of the population. A test for this disease returns a false result with probability 
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p. 



P(negative|sick) = P(positive| healthy) 
P(positive|sick) = P(negative| healthy) 

P(sick) 



p 



e 



1 — p 



Bayes theorem tells us that: 



P(false negative) = P(sick|negative) 



pe 



P (false positive) = P (healthy [positive) 



pe+(l-p)(l-e) 
p{l-e) 



p{l-e) + {l-p)e' 



As you can see these probabilities look much different even if the accuracy of the test is the 
same for false positives and false negatives. What is happening? The rarity of the disease 
determines a very high rate of false positives, in fact it can be shown that more than half of 
the positives are false unless the probability p of having an inaccurate result is smaller than 
the prevalence of the disease e. 

Bayesian inference in quantitative biology 

Bayesian inference has an increasingly important role in quantitative biology: the emergence 
of large data sets coming from molecular biology, neurosciences and molecular biology has 
increased the need for sophisticated mathematical techniques for their analysis. 
Examples of biological systems are being successfully investigated through the use of Bayesian 
inference range from phylogenetics [Huelsenbeck Olj , where one wants to reconstruct the most 
likely evolutionary tree from genetic data to gene regulatory networks where a stochastic ap- 
proach has been recently shown to be very successful [Elowitz 02[ IZou 05j . 
Moreover moving away from the molecular scale systems such as neural networks and bacte- 
rial motility have greatly benefited by such approaches. 

In what follows we will concentrate on two main problems and give a brief outline of a third. 
The first problem we tackled is that of spatial searches with dilute and stochastic information 
about the location of an object. More precisely we will turn to a strategy originally devised 
by Vergassola et al. [Vergassola 07b| that makes use of an informational theoretical approach 
for the location of an odor emitting source. 

During our thesis we have developed a continuous version of the algorithm and an extensive 
analysis of its performances and trajectories. 

The second problem we will turn to regards unzipping experiments of DNA molecules: the 
force-extension signal that can be measured in these experiments is strongly dependent on 
the DNA sequence. 

At first we will describe the direct problem of reproducing experimental traces on a computer 
and we will describe a software package we have developed with F. Zamponi, R. Monasson 
and S. Cocco during our thesis, that can simulate the dynamics of such an experiment in a 
highly modular way. 

Then we will propose several strategies for the inverse problem of reconstructing the sequence 
from the unzipping traces. 

Lastly we have devoted a section (appendix [A]) to a brief technical description of an algorithm 
for the inference of the interaction matrix of integrate and fire neurons. This algorithm has 
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been developed by Monasson and Cocco and our effort during our thesis has been a trans- 
lation of the code to the C language, the development of an interface with Matlab and code 
optimization. 
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Part I 

Infotaxis 
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Chapter 1 

Introduction 



1.1 Taxes and the biology of searching 

A taxis is the innate directional response of the motihty of an organism to a stimulus. On 
the other hand responses that imply a change in orientation or in the direction of growth are 
called tropisms and those which are not directional are called kineses. 

The term taxis is most commonly found speaking of unicellular organisms, because of its 
automatic and innate nature, even thought it is sometimes applied to insects and crustaceans. 
Stereotyped responses in higher organisms are commonly thought to be less reflex-like, they 
are usually categorized as instincts and are the subject of study of ethology. 
Taxes can be distinguished according to the nature of the sensory organs implied: 

Klinotaxis Different successive stimuli are measured by a single sensory organ. 

TropotEixis Well spaced sensory organs measure stimuli on different parts of the organism. 

Telotaxis The perception is mediated by a single directional organ. When the motor re- 
sponse is at an angle to the direction of the source some sources distinguish menotaxis. 

Taxes can also be divided according to the type of stimulus they respond to: chemotaxis 
(chemical gradients), phototaxis (light sources), geotaxis (gravitational fields), magnetotaxis 
(magnetic fields) and so on and so forth. 



1.2 Chemotaxis 

The type of taxis which has attracted the most interest in biology is probably chemotaxis, 
because of its ubiquity in unicellular organisms as inside multicellular organisms. 
The first observation of bacterial motility date back to the beginnings of microscopy, but we 
have to wait for the end of the nineteenth century for the first observations of responses to 
chemical gradients. 

It is important to distinguish, as we will do in the following, between bacterial and eukaryotic 
chemotaxis. 

Bacteria are very small cells, whose size is of the order of the micrometer, below that of typical 
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fluctuations of chemical fields: this forbids them to be directly sensitive to chemical gradients. 
Because of this chemosensation must happen through successive intensity assays. According 
to the preceding section definitions it is a klinotaxis. 

Eukaryotic cells can be much bigger than bacteria: some species can reach sizes of the order 
of a millimeter and typical sizes range in the tens and hundreds of micrometers. Because of 
this in eukaryotes chemosensation happens through the instantaneous differentiation of stim- 
uli coming from different parts of the organisms. In this case chemotaxis can be defined as a 
tropotaxis. 

In the light of this distinction and of the differences between motor organs in different organ- 
isms, bacterial and eukaryotic chemotaxis must be considered as different phenomena. 



1.2.1 Chemotaxis in bacteria 

Many reviews of bacterial chemotaxis exist in literature, for example the classic Adler's 
[Adler 66] or Berg's fBerg 88 , which has an extensive bibliography. Here we will follow 



another Berg's review [B erg 75] which is more focused on theory than on bacterial physiology. 



Microbiology's workhorse is certainly Escherichia coli (pictured in figure 1.1), partly for his- 



torical reasons, because of it's ubiquity in human guts and certainly for its simplicity. 
E. coli is endowed with about six flagella positioned on its surface. When those turn anti- 
clockwise they form a bundle and push the bacterium in a definite direction. Flagella can 
turn clockwise too: when this happens the bundle opens up and the bacteria tumbles on itself 
in a random fashion. 

Those two modes of movement are the fundamental components o chemotactic response in 




Figure 1.1: A specimen of Escherichia coli. Notice the flagella that enable it to move, now 
unbundled. 



flagellates and are called swims in the first case and tumbles in the second. 

Swims length is temporally limited by Brownian noise which, at room temperature for a body 

of size of a micrometer, decorrelates the heading of the bacteria in about ten seconds. Because 

of this reason bacteria tumble before losing their original heading completely. 

Tumbles on the other hand are a random event which last about a tenth of a second. The 

new heading of after a tumble is completely independent of the one before. 

Up to here the description of the motion of a flagellate does not differ significantly from a 
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random walk; in the absence of chemical gradients the duration of swims is distributed as an 
exponentially random variable (that is to say that tumbles are a Poisson process). 
Directional response in the motion of E. coli happens through the variation of the average 
duration of swims: if the bacteria is moving in a favorable direction swims become longer. 
This observation is compatible with what we have said about the klinotactic nature of bac- 
terial chemotaxis. Because of diffusive reasons, bacteria are not capable of discriminating 
between favorable and unfavorable directions during a tumble, but it is forced to sample the 
gradient during the swim. In other words the chemical gradient signal to noise ratio is big 
enough only on distances of the order of swims, not on the scale of the size of bacteria. 
E. coli temporal response to gradients has been studied thanks to the response to short im- 
pulses. Bacteria effectuate time differentiation through an integral of concentration at different 
times multiplied to a function which has a positive weight for the first second immediately in 
the past and a negative weight for the three preceding seconds: 



P(tumble) = l-k dtc{t)w{t) , 



(l.I) 



where k and I are positive real constants that ensure normalization and w{t) is a compact 
support weight function which has the characteristics we have just described and which were 



measured by Segall et al. in [Segall 86] (see Figure 1.2). This can be rewritten integrating by 
parts as: 



P(tumble) = l + k dt c'{t)W{t) , 



(1.2) 



where is a compact support probability distribution which is zero outside the integration 
domain and W'{t) = w{t). 



The real world w has been measured by [Segall 86] and is shown in figure 1.2 the two lobes 
have equal area, which is consistent with our definition of W . The fact that the derivative 
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Figure 1.2: The response of bacteria to a chemoattractant in wild type E. coli. The dotted 
curve is the bias in the rate of tumbles after some attractant was pulsed at the vertical bar. 



From Segall 86 
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is averaged over a finite period of time is a desirable property, in fact it allows bacteria to 
average out fluctuations in concentration fields. On the other hand run lengths never get 
longer than a few seconds, because bacteria aren't able to go in a straight line for long periods 
of time because of rotational diffusion. 



1.2.2 Chemotaxis in eukaryotes 

As we have previously mentioned, eukaryotes sense chemical gradients in a way which is much 
different from bacteria. This difference has an effect on typical trajectories of a chemotactic 
eukaryote which, being able to sens gradients instantaneously and being much less affected 
by Brownian effects, is able to climb the chemoattractant gradient directly. 
Motility in eukaryotic cells happens through ameboid movement (as in slime molds), cilia (as 
in Tetrahymena, or through the eukaryotic flagellum (as in Chlamydomonas) , all these means 
of transportation are much more precise than the bacterial flagellum. 

Eukaryotic chemotaxis is not confined to unicellular organisms: it plays a central role in em- 
bryogenesis, in the immune system and also the spread of metastases. 

As is the case with many biological phenomena eukaryotic chemotaxis has its model organism: 



Dictyostelium discoideum (pictured in figure 1.3), a soil living amoeba which cycles through 

an unicellular and a multicellular state according to the environmental conditions. 

When D. discoideum undergoes starvation, it starts secreting cyclic AMP which is a chemoat- 




Figure 1.3: A few specimens of Dictyostelium discoideum. 



tractant, this way cells move towards one another until they stick to each other. When the 
cells are lumped together they form what is referred to as a pseudoplasmodium, or more 
colloquially a slug which measures a few millimeters. Some other slime molds can form pseu- 
doplasmodia of sizes of square meters which are commonly found on forest floors. 
D. discoideum we observed for the first time in 1933 [Raper 35 1, in the following years its life 
cycle was described in detail [Raper 40| and in the fifties cyclic AMP was identified as playing 
a central role in aggregation [Shaffer 53] . Nevertheless it wasn't until the beginning of the 
seventies that a model for aggregation was proposed [Keller 70] . and despite some resistance 
in the microbiology community later accepted. 

What was novel about this model was that aggregation was described as a truly collective 
phenomenon, like those found in the statistical physics of phase transitions. 
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1.3 Discrete infotaxis 
1.3.1 Historical models 

The description we have given for chemotactic cehs rehes heavily on the size of cells and on 
the nature of chemical gradients at their scale. If one wishes to model olfactory search, one 
has to deal with turbulence, intermittent signals and dilution of fields. 

First of all most chemoattractants degrade over times and scales which are relevant over the 
size of a typical search, we will see that this leads to exponentially decaying concentrations 
and that this has to be taken into account. 
Moreover the nature of olfactory system is such that it is impossible to instantaneously 




Figure 1.4: Left: a specimen of tobacco budworm {Heliothis virescens), a species of moth. 
Right: a few recorded trajectories of H. virescens from [Vickers 94] . 

perceive the spatial derivatives as in a tropotaxis: nostrils are usually very close and even if 
they were to be as far apart as ears or eyes the spatial information they would get would not 
be reliable. This is because of the effect of turbulence, local concentrations do not necessarily 
reflect the distance or direction of the source. 

In the past there have been a few attempts to define search strategies when information is 
scarce: one classic reference is Gal's book on search games |Gal 80] , but the amount of infor- 
mation in classical search games is simply too scarce for our purposes: there is no equivalent 
of the odor field, that is the source is found when the searcher is close enough and the searcher 
has no clue whether the source is close by or not unless it has been found. 

One further development of search strategies was given by Balkovsky and Shraiman [Balkovsky 02| 
who proposed a model for olfactory searches where both the searcher and the odor particles 
are bounded to move on the sites of a bidimensional discrete lattice. The model supposes an 
average wind direction, that we can take without loss of generality to be up to down. Odor 
particles then are made to move down at every time-step and can either move left, right or 
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not move at all on the horizontal axis with equal probabilities. Odor particles don't decay 
as in more refined models, thus the odor field is never dilute when the searcher is downwind 
with respect to the source and close to the wind axis. 

The authors observed that the stationary probability of finding an odor particle in x,y when 
the wind blows in y direction and one particle per time step is emitted is given by: 

1 x2 

P{x,y) = ^^^ e ^ (1.3) 

where D = (pj. + p\)/2 = 1/3 and pr = p\ = 1/3 are the probabilities of moving left and right. 
That is, being the variance proportional to y, most odor particles will be confined to the area 
< (AnDy). 

If an encounter has just been made and the searcher has no prior information on the position 
of the source, it follows from the Bayes' theorem that the source is most probably located in 
the area defined by a parabola having for vertex the position of the odor encounter. Prom 
this observation stems the strategy devised by the authors: once an odor particle has been 
encountered the searcher explores exhaustively zigzagging the area where the source is most 
probably located until either the source is found on another particle encountered. For a clearer 
pictures of what a typical trajectory looks like see figure [L5| 
The main drawback of this strategy is that it is guaranteed to work only in the case of non- 

SOURCE 



AVERAGE 




Figure 1.5: A sample trajectory of the algorithm proposed by Balkovsky and Shraiman. The 
continuous line is the trajectory, the dashed line the parabola that is the boundary to the 
area where the probability of encountering odor particles is significantly different from zero 
and the circles are the odor hits. From [Balkovsky 02| 



decaying odor particles, that is when the odor concentration does not decrease exponentially 
with the distance. 
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1.3.2 Definition of the odor detection model 

Recently Vergassola et al. have proposed an algorithm for olfactory searches: here we will 
describe what is the odor model that underlies their search strategy using the formalism used 
in the Supplementary informations of their paper [Vergassola 07b . 



The stationary concentration of odor particles c(y) in the absence of an average wind is given 
by: 

DVc{y)--c{y) + R5{y-y*) = Q, (1.4) 
r 

where D is the diffusion coefficient, that stems from molecular and turbulent diffusion, r is 
the mean decay time,i? is the rate of emission of odor particles and y* is the position of the 
source. 

This equation has analytic solutions and in two dimensions yields: 

-fe) - (^) ■ (-) 

where Kq is the zero-order modified Bessel function of the second kind, A is a characteristic 
length given by A = \fT)T and can be interpreted as the mean length traveled by an odor 
particle before decaying. It will be used in the following as the natural unit of lengths. 
In three dimensions the solution is: 

_ \v-v*\ 

C3(y) = /^^^. (1.6) 

^T\U \y — y \ 

The rate of encounter of odor particles per unit for a spherical searcher of radius a is given 
by relation due to Smoluchowski [Smoluchowski 17] : 

_ \y-v* I 

i?3(y) = ^i^Dac-M = r"-^ . (1.7) 

A \y-y*\ 

While in two dimensions the relation is: 

My) = ^^c,{y) = R "^^^^ ^ (1.8) 
where R is the number of emitted particles per second. 

These two equations define the natural unit of time that we will use throughout this work: in 
three dimensions the unit of time is while in two dimensions it is log (^) /R. Once the 
unit of time and length are defined through the actual physical constants of the system we 
need not worry about those details anymore: the description we will give of the system will 
be completely independent of them. 

Once this relation is known, the idea is to model the erratic nature of odor detection in a 
turbulent fiow as a Poisson process with a rate proportional to this rate of detection. This 
way odor is perceived through discrete hits which vary in frequency as we move closer to the 
source. Hits contain no information pertaining the direction of the source and are all equal in 
intensity. The probability of getting n hits during time Ai while standing still at coordinates 
y is: 

Py{n) = (^^e-^*^(^) . (1.9) 
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Figure 1.6: The rate of encounter of odor particles, in two (left) and three (right) dimensions. 
The divergence in the origin is much more abrupt in the three-dimensional case, than in the 
two-dimensional one, but the asymptotic behavior for large arguments is the same. 



This equation allows us to write the probability of receiving a mimber of hits n along a 
trajectory at times U given the knowledge of the position of the source, that is: 



P{x{t),ti\y) = e^p(^- dt'R{\y-x{t')\)^l[R{\y-x{ti)\), (1.10) 

where we have supposed no two hits happen at the same time. While this is reasonable for a 
continuous time description, in a discrete time framework one has to divide by n! whenever 
n hits happen during the same time-step, but we will see later this is of no importance. 

1.3.3 The Bayesian posterior 

Using Bayes' theorem we can write the probability of the source being at position y given the 
trajectory and the hits' times: 

exp (- difRi\y - xit')\)) Uli Rilv - <U)\) 

Ptiy\x{t),U) = Po{y)- , (1.11) 

/ dy' exp (- Jl dt'R{\y' - x{t')\)) U"=i RiW - xM) 

where Pq is the prior distribution for the position of the source, we will see later how this plays 
a central role. On the other hand the attentive reader will have noticed how the previously 
mentioned n! is cancelled out in this expression. 

This expression has a few interesting features: the exponential term accounts for the vanishing 
probability of finding the source along the trajectory, that is: if the source was along the 
trajectory it would be found; it is also responsible for the low probability of points close to 
the trajectory. On the other hand the terms in the product are diverging and concentrate the 
probability around the points where most hits have occurred. 

1.3.4 The expected value of the variation of entropy 

The main idea behind Vergassola et al. algorithm is to exploit the Bayesian posterior as 

defined in the previous section to define the best movement at the next step. 

This is done by defining the entropy of the posterior at a given time and by choosing the 
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direction that maximizes its decrease: that is the direction where we expect to gain the most 
information on the source. 

We will now compute this quantity in order to analyze the different contributions that make 
it up. 

Even if the description given up to now is completely independent of the nature of the space 
where the searcher moves, be it a discrete lattice or an Euclidean space, and whether the time 
is discretized or continuous, we will from now on follow the description of the discrete version 
of the algorithm given by Vergassola et al. . 

Let Pt{y) be the posterior probability distribution at time t. It's entropy is defined by: 

S{Pt) = - ^ Pt{y) log(Pt(y)) , (1.12) 
y 

where the sum runs on all the lattice sites y. 

If our searcher is on one of the site of the lattice, it is now possible to compute the expected 
variation of entropy of the posterior distribution described above, resulting from a move on 
one of the adjacent lattice sites x: 

{S{Pt+At) - S{Pt)) = -Pt{x)S{Pt) + (1 - Pt{x)) I^A^norm + J] PiAS^ , (1.13) 

where the expected value has been taken with respect to the posterior probability distribution 
at time t. 

Let us analyze the terms one by one: 

—Pt{x)S{Pt) The source is found to be in x and the entropy vanishes. The probability for 
this event to happen is given by the posterior Pt{x) and the new value of the entropy is 
zero, that is the variation is —S{Pt). 

(1 — Pt{x)) ASnorm The source is not found, the probability of it being at site x is now zero 
and the whole probability distribution has to be normalized. It can be easily computed 
as: 



^ {Ptix)SiPt) - 5b(Pt(x)) , 



(1.14) 



1-Pt{x) 

where -S'b(p) = —plog{p) — {1 — p) log(l — p) is the binary entropy function. 



(1 — Pt{x)) Yli PiASi The source is not found, but at site x the searcher receives i hits, pi is 
the probability of receiving i hits and ASi is the corresponding entropy variation, that 
can be calculated remembering that: 



^•^ E(y — xVe'"^*'^^^/-^:) 
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The main idea behind Infotaxis is to use this variation of entropy as an instantaneous po- 
tential and to move in the direction where the entropy decreases faster. With this in mind 
different terms play the contrasting roles of exploration and exploitation in the search. The 
first term is more negative when the probability of finding the source at site x is larger and 
can be thought as an exploitation term, where the searcher tries to move greedily where the 
source is more likely to be found. This term only dominates at the end of the search when 
the probability is well concentrated. 

The last two terms favor the collection of new information, through, on one hand, the elimi- 
nation of possible candidates for the source position, and, on the other, the collection of hits. 
One of the most compelling features of this algorithm is that the balance between exploration 
and exploitation seems to be automatic, we will sec in the following that this statement needs 
to be refined, and that one can see the algorithm as greedy on the entropy potential and that 
a class of more powerful algorithms can be imagined on the basis of this. 
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Continuous infotaxis 



2.1 Derivation of continuous infotaxis 

We now turn to the problem of the derivation of a continuous form for infotaxis which we 
have done during our PhD. There are a few reasons for doing so: first of aU, real organisms 
experience the world as continuous and a lattice based description of the world seems very 
artificial. 

Secondly, the original algorithm poses a very realistic odor propagation model, while retaining 
a discrete description of the searcher and of its vision of the world. This makes the model 
anisotropic, in fact if we suppose the source is at a certain euclidean distance, the searcher 
will experience the same number of hits (on average) regardless of the direction of the source 
with respect to the axes of the lattice, but the direction of the source might decrease the 
number of steps needed to reach it of a factor of up to ^fd, where d is the dimension of the 
space. 

Another inconvenient of a discrete model is that the lattice is finite and the time needed to 
sum over all of its sites limits what can be practically done, especially in three dimensions 
where only a few trajectories on a small lattice were generated [Masson 09j . 
A continuous description on the contrary allows the description of unbounded domains and 
the use of adaptive techniques to improve precision if needed. 

One important thing must be stated before we begin: there is not one possible translation of 
infotaxis in the continuous limit, what we will do is only one of the many options. 
In the following we will derive our version of continuous infotaxis in two different ways: the 
first is somewhat lengthy and cumbersome, but it follows closely from the discrete definition, 
while the second is much more compact but we think showing both might shine different lights 
on the problem. 

The first difference between a discrete and a continuous model is the nature of the proba- 
bilistic description: from now on we have to distinguish between probabilities and probability 
densities which we will denote with pt{x). In order to complete the discussion of the con- 
tinuous limit we have to identify three independent scales which are relevant in the spatial 
part of the limit which are identical in the discrete version of the algorithm. These are: the 
lattice spacing, the size of the source Us and the area (or volume) perceived by the searcher 
in a time-step cTp. 

To rephrase this: in the discrete version of the algorithm during one time step the searcher 
is able to rule out the presence of the search on one lattice site. The source size is one lattice 
site. Performing the continuous limit we could, in principle, leave the size of the source and 
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of the searcher perceptions finite for a vanishing lattice spacing. 



If we analyze one by one the terms of equation (1.13) we obtain: 



—Pt{x)S{Pt) While dealing with discrete probabilities the entropy of a sure event is zero, 
on the other hand for continuous distributions the entropy of a Dirac distribution is 
negative and divergent. In this case if the source is found the entropy does not diverge 
because the source has a finite size o"s. Therefore this term is o"pPt(x)(log(cJs) — S{pt)) 

[I - Ptix))AS 

norm When the searcher moves the probability in the area Up around its po- 
sition x{t) becomes zero, thus the expected value for the variation of entropy due to 
the new normalization reads Pt[x)apS{pt) — S\^{pt{x)ap), where we have considered pt 
constant in the area dp and where Sh{p) is the binary entropy, as function defined in 
the previous chapter. 

(1 — Pt{x)) Pi^Si For what concerns the terms depending on the expected number of hits, 
we will focus on none or a single hit in a time At, because the probability of having more 
is negligible when At is small. That is: pi = At{R{y — x{t))) + O(At^) and po = I — pi. 
Thanks to the definition of the posterior we can write down the probability density at 
time t + At if an hit has occurred in the interval At as: 



or if it hasn't occurred: 



p'+^'^y^-p'^yh-At{Riz-xm 

= pt{y)[l + At{{R{z - xm - R{y - x(t)))] + 0{At' 



(2.2) 



where we have omitted the vector norms in the argument of the R and where the average 
is performed over the variable z. Notice that we only need the zeroth order in At for 
the term for one hit. 

Omitting all dependencies, the entropy variation for no hits reads: 



PoASo = (1 - At{R)) [S U^,} - S{p-, 



[l-At{R)){At{m-R)^og{pt))) (2-3) 
-At{{{R)-R)logipt)) , 



while that for one hit is: 



PiASi = At{R) i^S (pI'_^^,) - S{pt) 

= At ^i?log [pt^^ + (R) log(pt)^ . 
Putting all the terms together one obtains: 

{S{pt+At) - S{pt)) = CTpPtix) log((Ts) + 5b(pf(x)fTp) 
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at first order in At. 

When the size of the area observed by the searcher vanishes all the terms on the first line 
vanish (even if the area of the source is zero). One must also observe that in the continuous 
limit the area ap must be written as svAt where s is the cross section of the searcher's 
perception and v its speed. 

On the other hand we can regard the number of received hits as a message on the position 
of the source. We can compute the mutual information between the random variable Y, the 
position of the source and the random variable A^, number of hits at first order in At. 
If one remember the meaning of the rate function R, P{N = 1\Y = y) = AtR{y — x) + o{At) 
and conversely P{N = 0\Y = y) = 1 — AtR{y — x) + o{At), it follows that: 



where all the terms except for A^ = 1 are of higher order in At. 

The main idea behind discrete infotaxis, that is: to move in the direction that minimizes 
the entropy of the posterior distribution, here translates into moving in the direction that 
maximizes the mutual information between the two variables. 

One of the possible strategies to move in the direction that maximizes the gain in information, 
and arguably the simplest is that of forcing the searcher to obey Brownian dynamics, where 
the opposite of the information gain is viewed as a potential to be minimized, that is: 



where 7 is a friction coefficient that will be considered constant. 

It can be argued that this equation cannot be considered equivalent to infotaxis, because the 
velocity is not constant. We have discussed this in detail in [ Barbieri llj . and we will not 
dwell upon the details here. 

It suffices to say that there is no way to impose a fixed velocity in a continuous framework: 
suppose for example that we choose 7 as a function of the right hand side so that the velocity 
is equal to a constant V, we have observed that if we choose too big a ^ we observe long steps 
and a lot of backtracking. This is clearly an effect of the finite integration time-step and it is 
an effect that disappears in the small time-step limit, but we believe it is symptomatic of a 
system that chooses it's own velocity by changing the direction continuously. 

2.2 Search strategy before the first hit 
2.2.1 Choice of the prior 

Bayesian techniques are usually very powerful, but the choice of a suitable prior can often be 
difficult. One can hope for the existence of a obvious choice, or that the results do not depend 




(2.6) 




(2.7) 



And for the searcher: 
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too much on the specifics of the prior. 

The situation at hand is less clear cut: while all of our quantities have a clear probabilistic 
interpretation, what we ultimately want is for the algorithm to be performing well. 
Vergassola et al. chose a prior proportional to the odor propagation function R which has a 
few desirable properties: it is normalizable, it has a possible interpretation in the framework 
of our model and it does not define a new, arbitrary length scale while still concentrating 
most of the probability over a finite area. 

Another possible choice in the discrete version of the algorithm is the uniform distribution, 
where every lattice site is given equal weight, even though this was not included in the original 
infotaxis paper, we have toyed with this prior only to obtain trajectories that go straight until 
they reach a distance of approximately A from the boundary of the lattice. 
Unfortunately, this lattice choice does not have any equivalent in unbounded continuous space, 
because of this we cannot translate our results in this case. 
We will now concentrate on two priors: 

One-hit prior Proportional to the right R in the appropriate dimension. It has an inte- 
grable divergence at the origin, but for every r, no matter how small R{y) exp(— ri?(y)) 
is finite for y = 0. 



Exponential prior Proportional to exp(— y/do)- Choosing = A we have the same asymp- 
totic behavior for large y. This can be used to investigate how important the small scale 
behavior of the prior is. 



In his original paper [Vergassola 07b[ [Vergassola 07a| Vergassola et al. proposed the first prior 
as a natural choice. 

As suggested by the name we have chosen, we could consider the one-hit prior as the result 
of a search process that has started just after the searcher has received the first hit. 
This of interpretation, however, poses some problems: how can we justify search trajectories 
that start very far from the source? If we stick to this interpretation they should be considered 
as very rare events. 

This can be salvaged by considering only trajectories that start close enough to the source. 
As we will see in the following, it doesn't make much sense to employ such a sophisticated 
algorithm when there's effectively no information to gain. 



2.2.2 Spirals 

In [Vergassola 07b| Vergassola and collaborators described logarithmic spirals in discrete in- 
fotaxis, before the first hit. After observing several trajectories where the source of odor had 
been turned off, we have concluded jBarbieri 07] that spirals do appear in discrete infotaxis, 
but they are not logarithmic, but Archimedean in nature. That is the spacing between sub- 
sequent arms is constant. 

In what follows we wish to characterize spirals in two dimensions and their equivalent in 
three dimensions for continuous infotaxis, the debate over discrete infotaxis having since been 
settled jMasson 09[ with further simulations in hexagonal lattices. 
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One-hit prior 

In two dimensions the searcher moves in spirals for a wide range of values of 7, as is shown 



in figure 2.1 When 7 is too big spiral behavior breaks down. 

This behavior can be explained by a very simple argument: for a large range of values of 7 
the searcher effectively visits a region of area proportional to the elapsed time. In a way the 
probability of finding the source in a given area is discounted in a given time thanks to the 
negative exponential term in the posterior. Once the source is not found the searcher moves 



elsewhere. This effect on the prior can be directly observed in figure 2.2 
This area does not depend on 7, while the linear velocity of the searcher does. For this reason 
this only has an effect on the spacing of the arms. More quantitatively if b is the spacing 
between successive arms then what we observe is consistent with b ~ ^Jj and |x| ~ 1/5. 
Spiral behavior is not observed for large 7 (> 0.08), we think that this is due to the fact 
that the RlogR kernel has a range which is proportional to A and for large 7's we would 
expect arm spacing which are larger than this range. In other words the algorithm cannot be 
sensitive to the probability distribution at large distances. 

To validate this hypothesis we have run a few simulation with a modified kernel with larger 
and shorter range, and we have indeed observed that this moves the spiral-breaking-down 
threshold in the expected direction. 

In three dimensions there is no exact equivalent of a spiral: the searcher will try to stay 
as close as possible to where it started as a result of the exponentially decreasing prior, but 
will move in a self avoiding trajectory, because of the term exp ^— J* dt'R(y — x{t')^ in the 
posterior probability. 

We have observed the first part of the trajectory to be quasi-planar and then to break off and 



start occupying all available space, this is shown in figure 2.3 where the dependence of the 
distance from the origin is plotted as a function of time and compared with the curve t^^^ 
which corresponds to the prediction of space filling trajectories. 

Three dimensional trajectories look like balls of yarn, compact coiled structures. We think 
that parallels can be drawn with the solutions of the Thomson problem for polyelectrolites 



[Angelescu 08 , ICerda 05l ISlosar 06] , which has received a lot of attention recently because of 



its connections to the problem of DNA packing in virus capsides. 



Exponential prior 

Another way of interpreting the choice of the one- hit prior is to consider the details of the prior 
at short range from the starting point of the searcher as mostly irrelevant and to concentrate 
on the asymptotic behavior. 

Ignoring small scale behavior makes a lot of sense in the case of discrete infotaxis, where the 
scales smaller than the lattice spacing are not accessible, and the probability at the starting 
point of the searcher is exactly zero regardless of the prior. 

The exponential prior can be also justified because of its memorylessness property that is: 
P{Y > y-\-d\Y > y) = P{Y > d) and furthermore because it is maximum entropy distribution 
with a fixed mean. 

This two mathematical properties could be used to justify the Archimedean nature of the 



spirals, which can be checked in figure 2.4 The spirals however break down, as discussed 
before for the case of variable 7, when the arm spacing b would exceed the range of the kernel 
RlogR. 



2.2. SEARCH STRATEGY BEFORE THE FIRST HIT 



23 



CHAPTER 2. CONTINUOUS INFOTAXIS 




a function of time. As this quantity is proportional to the area explored in a given time, we 
show here that this is proportional to the elapsed time for several values of 7. For large 7 this 
behavior breaks down and the searcher eventually halts. The trajectories for 7 = 0.01,0.1 
correspond to panels A and D. C. Many values of the spacing b between spiral arms, as a 
function of 7. The dotted line corresponds to a slope of —1/2. 
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Figure 2.3: A three-dimensional trajectory in the absence of hit for 7 = .01 (left), with its 
quasi two-dimensional initial portion (top); the time axis is color coded. Bottom: distance to 
the origin, d{t), compared to the power laws t''^^, then t^^^. 



The fact that the behavior of the searcher for both the one-hit prior and the exponential prior 
produces spirals, suggests that the spirals are a consequence of the asymptotic behavior of 
the prior at large distances. We will try to verify this with a Taylor series expansion of the 
right hand side of the equation for the movement of the searcher. 



2.2.3 Small x expansion 

It is possible to characterize the spirals as an instability by performing an expansion for small 



X of equation 2.8 



7f(t) =ai{t)x{t) + a2{t) [ dt'x{t') 

Jo 



+ / dt'x{t') 
Jo 

+ [ dt'x{t') 
Jo 



^l{t)\x{t')f + ^2{t)x{t') ■ / dt"x{t" 

Jo 

f3sit) dt"\xit")\^ + I3^{t) (^j'^dt"x{t") 



(2.9) 



where the ai{t) and /3j(t) are time dependent coefficients, respectively for the first and third 
degree. All other terms vanish for symmetry reasons. We need to stress that this expansion 
is only valid for three dimensions. 
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Figure 2.4: Three trajectories without hits for the exponential prior with varying do- As 
highhghted in the text, for low enough do spirals are observed with spacing b oc do. When 
do is too large, as we have observed for varying 7, spirals break down. Such is the case for 
do = 3. 
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Defining: 



J dy exp{-tR{y)) 



we can express the terms of the development as: 

\2 



{iJ(9)), 



1 (i?(y)(i?"(y) + 2™);^ 
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1 1 



2\ 



^ <(i^'(.))^>. {iR\y)?^o, (^))^ 1 ((i.'(.))^iog ^^^^^^^^^^ ^ ^ 



(2.16) 



If one looks at the equation up to the first order, neglecting the /3 terms, one can already 
explain the instability that leads to spirals. 

Since ai ~ > and 02 — —^^^-^ < for large t. ai is positive so the trajectory 

starts as a straight line out of the origin, but then the term a2 which is unstable makes it 
unstable against local bending explaining planar spirals. 

An analytic solution of this simplified equation is possible if one approximates the coefficients 
neglecting the logarithmic terms. 

Ps and ^4 are coefficients to terms that lie in the same plane as the first order ones. Because of 
this we will only concentrate on /3i and ^2- Those are both positive and lead to the instability 
of the planar trajectory eventually leading to a full fiedged three dimensional structure. 



2.2.4 Waiting time 

One interesting feature of the spirals is that they do not start immediately as in the discrete 
algorithm. This seems to be at odds with the results obtained in previous section: ai is 
always positive, this means that staying in the origin without moving should be unstable. 
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How to reconcile this apparent paradox? 
Let us define: 



K(x) = -(i?(y-x)log(^|||-||y)) , (2.17) 



where we have sticked to the definition of the brackets of equation (2.10) in the previous 
section. 

If we plot Vr{x) along a direction for different values of r and we compute its minimum, as in 
figure [23] we find indeed that there always a maximum in x = 0, but there is also a non-trivial 
minimum for every r > 0, albeit this minimum can be very close to the origin for small r. 
The curve of the minumum XmiT) = argmin^,. Vr{x) is well fit by as Xm{T) ~ 6.62 exp(— 2.32/t) 
in two dimensions. 

These results can be further substantiated by convolving the R with a Gaussian distribution 

of width a, this way a cr-dependent crossover in r can be shown to exist between waiting 

and moving. The interpretation of the Gaussian convolution is that, because of the numerical 

integration, when the searcher is waiting it effectively fluctuates around the origin. 

All this can be summarized by saying that, if the noise is zero or stays within an acceptable 

range, the time it takes the searcher to move perceptibly out of the origin is ~ 0.4. 

This is a very important feature of the continuous version of infotaxis which is not present in 

its discrete counterpart. This is due to the fact that setting a whole lattice site probability 

to zero creates a very strong repulsive effect, and since the area that is set to zero in the 

continuous version is infinitesimal there is no inhibition of this effect. 

The striking feature of this effect is that it reproduces itself whenever there is a new hit: the 
searcher stops, waits about 0.4 and then starts moving again. We can think of it as if it were 
trying to exclude that the source was in its immediate vicinity. 

There exists a distance from the source when the expected arrival time of two successive hits 
is smaller than the waiting time, when this happens the searcher will be effectively stuck at 
this position. We will call this distance dhait^ it is dimension dependent. It is ~ 0.1 for D=2 
and ~ 0.3 for D=3. 

dhait is more rigorously defined as 0.4i2((ihait) = 1- The reason for different values for different 
dimensions is the different form of R. 
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Figure 2.5: Profile of Vix) at varying r and position of its minimum. We have subtracted 
arbitrary constants to the various Vix) in order for the curve of the minima to pass through 
the minima. 
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2.3 Numerical integration 

In this chapter we wih illustrate the techniques we have employed for numerically integrating 
the continuous infotaxis equations. We will devote some time to justifying the choice of a 
technique that increases the complexity of the algorithm in favor of precision. 
At every time-step we have to compute the integral of the kernel over the probability measure 
in order to know the velocity of the searcher. The position of the searcher is then updated 
with a simple Euler integration step, that is: 

x{t + At) = x{t) + Atv{t) , (2.18) 

where v{t) is the velocity at time t defined as —'VxVt{x)/^. 

We have found empirically that a good choice for the integration time-step At = 7, this choice 
ensures precision when 7 is small and then the searcher is fast and economy when 7 is big 
and the searcher is slow. 

At each time step a Poisson pseudo-random variable is generated for the number of hits, this 

is recorded in a vector as is the whole trajectory. 

The whole procedure can be summarized in pseudo-code as: 

searcher=origin 
source=d_0/sqrt (dimension) 
i=0 

while (d_success<distance (source , searcher) <d_f ail) { 
old_n_hits=n_hits ; 

n_hits+=poissonrandoni(dt*R(distance (source , searcher) ) ) 
for ( j =old_n_hits ; j <n_hits ; j ++) 

hits [j] =searcher 
f orce=average (f orce ,RjR_prime ,x, trajectory ,history,hits) 
x+=f or ce*dt /gamma 
traj ectory [i] =searcher 
i++ 

> 

An important detail that can't be omitted is the calculation of the averages over the probability 
distribution. The original discrete infotaxis implementation performed this by storing and 
updating the complete probability distribution over the lattice. This is clearly impossible in 
the absence of a lattice. Especially since the search is performed in unbounded Euclidean 

space. 

We have, however, tried memorizing the probability distribution at points either on a non- 
square lattice or randomly picked in order to emulate the behavior of the original algorithm. 
This approach is plagued by various serious shortcomings: first of all we need to choose the 
points at the beginning of the search, and it is natural to choose them concentrated around 
the starting position. After a certain time, however, the searcher will have moved farther 
away where the points are rarer and numerical precision will start suffering. 
Another big problem is that the computation of integrals as sums over a set of point that 
does not change will effectively recreate a lattice, albeit not a regular one. The trajectories 
will stick to those lattice points because visiting them directly is optimal for the information 
gain. 

In order to avoid these artifacts, that crippled the simulation even for relatively short run 



2.3. NUMERICAL INTEGRATION 



31 



CHAPTER 2. CONTINUOUS INFOTAXIS 



times, we have decided not to store and update the whole probabihty distribution, but to 
store the trajectory and the hits and to calculate the probability distribution dynamically at 
each time-step. It is now possible to perform the integrals by Montecarlo importance sampling 
around the position of the searcher, and choose a different set of points at each time-step. 
The procedure is as follows: one performs a change of variable for the argument u = \x{t) — y\ 
of the functions to integrate, u = (^{v) = «o(l — v)/v, where v G (0,1], then angles are 
sampled uniformly in two or three dimensions. 

-^MC points are sampled this way (typically 10^) for each time step, and summed taking care 
of the Jacobian of the change of variables v u. 
Again in pseudo-code: 

f unct ion average (functional , R , R_pr ime , x , traj ectory , history , hits) { 

sum=0 

for (i=0; i<MC_steps; i++) { 
y . angle=randoinangle () 
y . radius=phi (randomreal () ) 

jacobian=phi_primep(inverse_phi (point (radius) ) 

hitscontrib=l 

for(j=0; j<size(hits) ; j++) 

hitscontrib*=R(distaiice(y ,hits [j] ) ) 

if (dimension==3) jacobiaii*=rs*rs 

else jacobian*=rs 

suin+= j acobian*priorprob (y) *exp (-history (R , y , traj ectory) ) 
*f unct ional (y , x , R , Rp) 

} 

return sum/MC_ steps 

> 

The only bit left is the computation of the integral over the trajectory at the exponential: 

/ dt' R{y - x{t')) . (2.19) 
Jo 



To compute this we have used the classic composite Simpson's rule: 

(2.20) 



/ mat' 

J o 



At 
~3 



n/2-l n/2 

/(0) + 2 J2 /(2iAt)+45^/((2j-l)At)) + /(t) 
j=i j=i 



where n = t/At needs to be even. 

Taking extra care to ensure n is even, we get in pseudo-code: 

f unct ion hi story (R , x , traj ectory) { 

sum=0 

if (size(trajectory)==l) 

return dt*R(distance (traj ectory [0] ,x)) 
if (size(trajectory)==2) 

return dt*(R(distance(trajectory[0] ,x) )+R(distance(trajectory [1] ,x))) 
f lag=size (traj ectory) 7^2 

suin=R (distance (traj ectory [flag] , x) ) +R(distance (traj ectory [size-1] , x) ) 



32 



2.3. NUMERICAL INTEGRATION 



CHAPTER 2. CONTINUOUS INFOTAXIS 



for (i=l; i<=size/2-l; i++) 

sum+=2*R (distance (traj ectory [2*i+f lag] ,x) ) 
for (i=l; i<=size/2; i++) 

suni+=4*R( distance (trajectory [2*i-l+f lag] ,x) ) 
sum*=dt/3 ; 
return sum; 



2.4 Results and performances 
2.4.1 Typical trajectories 

In this section we wish to show what the typical trajectories of continuous infotaxis look like 
in two and three dimensions once we have introduced a source of odor, as the goal for the 
searcher. 

In two dimensions one can superimpose the trajectory to the probability and gain some good 
insights as to how the posterior probability is affected by odor hits. 

In figure [2^ one can see the searcher starts its trajectory spiraling around its starting position, 
and how the probability distribution is affected by this: the maximum of the probability is 
always in front of the searcher, and a valley of minima is dug where it has passed. 
In the third panel (bottom left) the first hit is received and the probability has a new maxi- 
mum. If the searcher didn't receive further hits in the last panel (bottom right) it would start 
spiraling around the position of the new maximum. 

In the last panel the probability distribution is very peaked around the real position of the 
source, which is about to be found. 



In figure 2.7 two trajectories are shown for two-dimensional infotaxis: the one on the left is 
successful in finding the source while the second is not. 

Notice how the unsuccessful searcher has received a very misleading hit, actually farther away 
from the source than when it started. We can imagine the probability distribution to be 
peaked somewhere closer to the position of the hit. This maximum becomes the center of its 
new spiraling, albeit these new spirals are not as regular as the ones we have observed without 
hits. 



Trajectories with hits are much harder to visualize, we try to do so in figure 2.8 but the 
trajectory covers itself. What can be gleaned from these two trajectories is that the searcher 
seems to be using less information than in two-dimensional searches. In fact the unsuccessful 
searcher receives no hit at all while the successful one received only two. 



2.4.2 Average signal 

We now wish to define what we think will be a very useful tool for the evaluation of perfor- 
mances: as we will see in the following, a large number of runs are needed in order to sample 
the probability of success and the time of success. This is due to the fact that the arrival 
times and positions of hits can vary wildly, and have a very strong influence on the searcher 
trajectory. 

If one observes the posterior probability density, one notices that the hits are encoded as the 
product of R functions centered at the position of each hit. As it is customary with multi- 
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Figure 2.6: The trajectory is plotted in red, superposed to the posterior probabiHty distri- 
bution. Along the trajectory hits are displayed as yellow spheres. The source is the blue 
cube. 
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Figure 2.7: Examples of search trajectories with hits two dimensions (7 = .02). The trajectory 
on the left finds the source, while the one on the right is not successful. The initial distance to 
the source is do = 2. The red disk represents points at distance < (ihait to the source. Black 
squares locate the hits. 
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Figure 2.8: Examples of search trajectories with hits three dimensions (7 = .01). The tra- 
jectory on the left finds the source, while the one on the right is not successful. The initial 
distance to the source is do = 2. The red sphere represents points at distance < dhait to the 
source. Black cubes locate the hits. 



plicative processes it is natural to look at the logarithm of the probability distribution. 



log Pt{y) 



f dt'R{y 
Jo 



H 

x{t')) + log R{y - xiU)) + const ; 

i=l 



(2.211 



where ti are the times at which the hits occur. 

If the searcher is at time t in position t the probability it will get a hit in the next At is given 
by AtR(y* — x(t)) where y* is the actual position of the source. Having observed this, we can 



take the expected value of equation (2.21) with respect to the probability of receiving a hit 
at each time-step. 
This yields: 

log Pt{y) = - [ dt' [R{y - x{t')) + R{y* - x{t')) log R{y - x{t'))\ + const, . (2.22) 
Jo 

If we now use the exponential of this newly defined quantity as the probability distribution 
that moves the searcher we obtain trajectories that have features that resemble closely those 
of trajectories with truly random hits. 

However, even if we have reduced greatly the variability among trajectories, numerical tra- 
jectories obtained for this average signal are not completely deterministic. This is due to the 
stochastic errors involved in Montecarlo integration and how those play an important role in 
the initial breaking of rotational symmetry. 

In other words, the searcher starts in a random direction which defines the phase of the turn- 
ings of the spiral. This random direction is not a feature of the Poisson noise of the hits, but 
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of the noise coming from Montecarlo integration. If we had access to a perfect integrator, we 
would need to add noise artificially at least at an initial stage to start the search. 
In figure 2.9 we compare a trajectory with random hits to a trajectory obtained with the av- 
erage signal when those have comparable duration. We also plot the entropy of the posterior 
distribution. Notice how it plunges in discontinuous jumps for the random signal and how it 
tapers off gently for the average signal. 




Figure 2.9: Entropy S{t) (bottom, left scale) for one trajectory x(t) obtained with random 
hits (top left, full curve, 3 hits are received) and the average trajectory (top right, dashed 
curve) . The dotted line shows the average number of hits Nh (right scale) received along the 
average trajectory. The source is located in {V^, \/2) (circle). 



2.4.3 Performances 

In order to evaluate the performance of the algorithm we have to look at the success prob- 
ability and the time needed to reach the source of odor in case of a success. But first of all 
we have to give a clear definition of success and failure. This is at odds with the discrete 
algorithm, where success was obtained when the searcher and the source were at the same 
position and failure when the searcher wandered out of the lattice. 

In a continuous, unbounded space these definitions do not apply. However we can define a 
radius dfaii ^ 1 from the source that defines the region of space out of which the search has 



2.4. RESULTS AND PERFORMANCES 



37 



CHAPTER 2. CONTINUOUS INFOTAXIS 



o 
o 

O 0.8- 



^ 0.6- 
o 



D = 2, Y= .02 
I I I 



I 1 

D = 3,y^ .01 



1 2 3 

do 



0.8 



0.6 



D = 2,dg = 2 



1 : 

1- H- 


1 


H I-H : 





£) = 3, = 2 



.005 .01 .02 .04 

y 



Figure 2.10: Probability of success of Infotaxis as a function of the initial distance to the 
source, do (left), and of the friction 7 (right). Top points correspond to D = 2 dimensions, 
bottom points to -D = 3. The numbers of runs is of about 200 for each point. All probabilities 
where obtained with dfaii = 8. 



not much hope of ever succeeding. The bigger dfaii, the less our results will depend on it. 
The definition of a dsuccess is a bit more delicate since too small a radius would have catas- 
trophic effects because of the pinning phenomenon we have described in the previous section; 
too big a radius would mean getting a lot of false positives and overestimating the perfor- 
mance of the algorithm. In the end we settled for ^success = t^hait- 

There are two parameters that need to be varied in order to evaluate performance: one is the 

distance from the source, the other is 7 that characterizes the dynamics. 

Another delicate issue is the definition of time: since our algorithm has a complexity per 

time-step which is linear in the elapsed time, CPU time will not be proportional to simulation 

time and we would need to optimize one or the other in different scenarios. 

We have investigated the success probability for different values of the initial distance between 

the searcher and the source. 

We have chosen distances between 1 and 3 in units of A, because, on one hand, larger initial 
distances would correspond to vanishing an exponentially vanishing probability of receiving 
one hit and would only lengthen the spirals without showing any interesting feature of the 
algorithm. 

On the other hand distances smaller than 1 are too close with the halting distance especially 
in three dimensions. Because of these two arguments we believe this is the only region where 
the behavior of this algorithm might be non-trivial. 

Another important parameter is the friction coefficient 7. Overall we have observed that the 
success probability is affected by neither the starting distance or the friction coefficient. It is 
compatible with unity in two dimensions and slightly higher than 80% in three dimensions. 



The results are detailed in figure 2.10 



This does not surprise us much: searches are easier in two dimensions, where random walks 
are space filling. The result in three dimensions looks promising and it is much better than 
any random estimate. The interested reader can refer to the classic reference by Redner 
[Redner 'Oi\ for a computation of the probabilities for the associated random phenomena. 
Let us now define the relevant quantities for the search time: first of all we will restrict our- 
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Figure 2.11: A. histograms of the search times ts in = 2 (7 = .02) and D = 3 (7 = .01) 
dimensions for an initial distance do = 2 to the source. Full histograms correspond to the 
average trajectories, contour histograms to trajectories with random hits. B. Average search 
time ts as a function of 7. C. total CPU time as a function of 7, calculated as (ts/At)^. 



selves to the successful cases. We define the success time tg as the time when the algorithm 
halts because the searcher has entered the disk of radius ^success- 

The CPU time can be defined in a implementation-agnostic form as (tg/At)^ since it will be 
generally proportional to this quantity. It should be noted that in the current implementation, 
with Iff^ Monte Carlo sampling points in spatial integrations and on a 2.4 GHz core of an 
Intel Core 2, ^ ~ 3 ms. 

In figure 2.11 we show, with the tg's and the CPU times for different 7's, an histogram compar- 
ing the results obtained with the average equation of the previous section with those obtained 
with the non-simplified equation. 

It is interesting to note how if one takes into account only the tg the algorithm is most efficient 
at low 7, however, since lower 7 call for lower At in CPU time the algorithm is much faster 
for high 7. 

This can be explained by remembering the dependence of the spiral spacing on 7: low 7 means 
tighter spirals and a searcher that moves much faster linearly: while this behavior turns out 
to be more effective at exploring the space it is more computationally intensive because the 
increased scalar velocity calls for a smaller time-step. 

Overall we think performance can be greatly increased either by reducing the number of Monte 
Carlo integration points or by reducing the number of points in the time integral. 
A reduction of the time points stored in memory can be obtained in two ways: the first is to 
add a finite memory, but if one is not careful one could end up with the searcher very strongly 
attracted back to the origin after a certain time, because the divergence of the prior is not 
attenuated by the trajectory anymore. 

A smarter option would be to add some sort of coarse graining in time: points become much 
rarer in the distant past, but they have an increasing weight in the discrete sum at the ex- 
ponent in the posterior probability. We would probably lose some precision this way, but we 
could recover a linear-time algorithm. 
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DNA unzipping and sequencing 
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Chapter 3 



Review of current sequencing 
technologies and their limitations 



In this part we wish to show how micromanipulation experiments on DNA molecules could 
be exploited to give us better sequencing techniques. 

In this chapter we will describe current sequencing technologies, then underline what are their 
current limitation and what is to be gained from single-molecule sequencing. This will be the 
basis and motivation for our further work. 

Modern DNA sequencing was developed in the second half of the seventies by Sanger et 
al. [Sanger 75 Sanger 77] , a few other methods were tried in the first part of the decade 



[Maxam 77] . but since they do not have modern day equivalents we will not discuss them 
here 



3.1 Chain-termination method 

The method developed by Sanger is based on the properties of dideoxynucleotides (ddNTPs): 
these are modified nucleotides: where normal nucleotides would be deoxynucleotides (dNTPs) 



these lack the 3' hydroxyl group on their dexyribose sugar (see figure 3.1), this means that 
once they are added to a growing strand of DNA, no further nucleotide can be added because 
they lack the ability to bind with it [Atkinson 69] . 

In order to be sequenced DNA needs to be single-stranded and in multiple copies each of which 
has a primer attached to the same point. The copies are then separated in four reactions all 
of which contain DNA polymerase and all four of the dNTP and only one of the ddNTP in a 
lower concentration. 

The DNA polymerase facilitates the binding of the dNTP on the complementary bases, but 
once in a while a ddNTP will bind to the chain halting the process. At the end of the process 
we are left with different pieces of DNA all starting at the same point (where the primer was 
bound) and ending at random points, with the constraint that all the pieces in the reaction 
that contained only ddATP end at a T basis, all those in the ddCTP reaction end at a G 
basis and so on and so forth. 

Now the molecules can be sorted according to their size with gel electrophoresis and pho- 
tographed on four different lanes (one for each of the basis), a black line will appear in 
correspondence to each base. 

Several variation to this technique exist: the ddNTP can be dyed in order for them to fluo- 
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Figure 3.1: Right: a normal Nucleotide Triphosphate where the sugar is a 2'-deoxyribosine. 
Left: a NTP where the sugar is a 2',3'-dideoxyribosine. The absence of the hydroxide on the 
3' carbon atom means it no further nucleotide can link to it. 
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Figure 3.2: One of the figures of Sanger's seminal paper Sanger 77] showing an autoradiograph 
of the four lanes of chain-termination sequencing and how they are used for sequencing. 



resce or tagged with a radioactive substance, but the essential mechanism stays the same. 
The main problem with this kind of method is that the quality of the sequencing traces de- 
grade after about 1000 bp. This is due to several factors: the first and most important is 
the nature of the random process involved in the binding of ddNTP. Suppose we are in the 
ddATP solution and the next base is a T, then the probability pdd of the ddATP binding 
instead of the dATP binding does not depend on the length of the sequence. On the other 
hand the probability of still finding a sequence of a certain length after having encountered n 
T's is (1 — Pdd)" and thus decreases exponentially. 

Another source of accumulating errors is the presence of two or more basis of the same kind 
next to each other, that is to say it is difficult to distinguish four C's in a row from five C's. 
This type of errors will crop up, making the alignment of the four different lanes difficult. 



3.2 Pyrosequencing 

Another very popular sequencing technique which is behind some current day automated 
sequencing methods is pyrosequencing. Developed by Ronaghi and Nyrem in the nineties 
fRonaghi 96 Ronaghi 98 , pyrosequencing relies on detecting the activity of DNA polymerase 



through the use of a chemiluminescent enzyme that will emit light whenever a new bond is 
formed. 

A single strand of DNA reacts with DNA polymerase, a chemiluminescent enzyme and so- 
lutions of one of the four nucleotides, which are sequentially added and removed. When a 
nucleotide binds to the next available spot, light is emitted and we know which base has 
bound because only one type of nucleotide was in solution at that moment. 
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Pyrosequencing is inherently limited to sequences of about 500 bp (more typically less than 
100 bp), but it is well suited to being automated and massively parallelized. Because of the 
limitations in the size of the the fragments it has been rarely used for de novo sequencing, 
instead it is either used in conjunction with other methods, or for resequencing and for the 
search for single nucleotide polymorphisms (SNP). Only recently read lengths of about 1000 
bp have been attained by a company called 454. This will allow for de novo sequencing using 
pyrosequencing . 



3.3 Sequencing by ligation 

Ligation is the joining of two double stranded DNA segments through the formation of two 
covalent bonds. This reaction involves an enzyme called DNA ligase. The difference between 
DNA ligase and DNA polymerase is that DNA polymerase needs one of the two strands to 
be intact while DNA ligase can repair double stranded DNA. 

DNA ligase can also be used to join a single strand of DNA to an otherwise intact single 
strand, but in this case it is very sensitive to mismatches, that is it will hardly ever join two 
strands which are not complementary. 

Several techniques are based on this specificity, namely ligase chain reaction (LCR) [Barany "OT 
IWiedmann 94j and ligase amplification reaction (LAR) |Wu 89] . we will not dwell here on the 
details, it suffices to know that these rely on oligonucleotides (short pieces of ssDNA, here 
typically 8-9 bases long) and their ligation to a the DNA that is being sequenced. 
A number of different oligonucleotides is added to the solution where the anchor sequence is. 
Then the ligase will hybridize two of the bases of the oligonucleotide to the anchor sequence 
and emit a light signature that allow the two bases to be recognized. 

Sequences are then reconstructed using two-base encoding, a technique that relies on these su- 
perposed two-base reads. Read lengths of up to 25-50 bases have been achieved [McKernan 09] . 



3.4 Limitations 

As you might have noticed, all of the techniques outlined up to here rely on read lengths of 
at most 1000 bp, while whole chromosomes and genomes have lengths that exceed this by 
several orders of magnitude. In order to fill this gap, DNA has to be spliced and amplified 
to be sequenced. Amplification is usually done through a technique called polymerase chain 
reaction (PGR) |Mullis 86[lMullis 94] . 

DNA can be cut in an ordered way starting from one end and then cutting regularly. This 
technique is called chromosome walking and it is the best method for sequences which are 
too long to be sequenced in a go, but still under 10000 bp. The shorter fragments are then 
sequenced leaving 20 or so superposing bases on each fragment to allow for reconstruction. 
Longer sequences as whole chromosomes or genomes are usually dealt with a technique de- 
veloped at the end of the seventies called shotgun sequencing [Staden 79] . 
The name derives from a metaphor: as a shotgun fires a large array of small projectiles in 
a random pattern, DNA is cut in random points into smaller sequences. The process is re- 
peated multiple times as to have several copies of the same sequence cut in different points. 
The spliced sequences can then be sequenced one at a time and then recomposed through the 



use of algorithms that rely on the overlapping between different copies (see figure 3.3). 
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Short reads are fine when we are looking for short mutations such as SNPs or anything shorter 
Target sequence 

AGTGACCATCTGGATTCGGCACGCCCATATGCCCACTTTTTTACATTAGTATTTATGG 



Clones' fragments 




Reconstructed sequence 

AGTGACCATCTGGATTCGGCACGCCCATATGCCCACTTTTTTACATTAGTATTTATGG 
AGTGACCATCTGGATTCGGCACGCCCATATGCCCACTTTTTTACATTAGTATTTATGG 
AGTGACCATCTGGATTCGGCACGCCCATATGCCCACTTTTTTACATTAGTATTTATGG 

Figure 3.3: Shotgun sequencing: the target sequence is cloned several times and cut at random 
points. The smaller segments are then sequenced and the sequence is reconstructed thanks 
to the overlaps. 

than the length of the typical read, but genomes are replete with mutations that are much 
larger in size such as copy number variations (CNV). 

Copy number variations are mutations that involve the deletion or the duplication of a sec- 
tion of DNA, they have lengths of at least 1 kbp and up to several hundred kbp and are very 
common throughout the human genome [Sebat 04} llafrate 04] . 

Copy number variations seem to play a central role in cancer |Shlien lO] , autism [Sebat 07] and 
in neurological conditions [Friedman 06\ IGlessner 10| ISundaram lOj . CNV are very hard to 
find with current sequencing methods, because reconstruction algorithms tend to miss them. 
The only way to effectively indentify them is to use classic sequencing techniques in conjunc- 
tion with microarrays for the detection of SNPs and very complex algorithms [Koike iT] . 
This is one of the main reasons for developing single molecule techniques for sequencing DNA, 
but current efforts are not very promising: zero-mode waveguide [Levene 03] seems to be the 
most advanced but it still offers read lengths of about 1500 bp, that is comparable with chain 
termination techniques. It is a technique based on holes which are small (~ 100 nm) in all of 
their dimensions compared to the frequency of light used for the observation. Their optical 
properties allow the observation of the enzymatic activity of a single molecule. 
On the other hand techniques based on nanopores look promising [ Clarke 09] . Nanopores are 
holes with a diameter of ~ 1 nm, similar to some proteins found on cellular membranes. DNA 
can be forced through the nanopore one base at a time. Since each nucleotide obstructs the 
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nanopore in a different way it is possible to distinguish between nucleotides by measuring the 
electrical properties of the obstructed nanopore. These technique is, however, at a very early 

stage of development. 

This is why in the following we will propose a novel approach based on single-molecule exper- 
iments of unzipping that could one day be used to sequence DNA. 

The reader should keep in mind that no single method is free from the trade-off between 
resolution and scope, that is to say that it is impossible to attain at the same time accuracy 
at a single base level and very long reads. 
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Chapter 4 

Modeling DNA unzipping 



In the past two decades, the development of experimental techniques that allowed the manip- 
ulation of single biological macromolecules at the nm and pN scale has afforded us a wealth 
of experimental data on the physical properties of said molecules. 

At the same time theoretical models have been devised to predict and model the behavior 
of said molecules. In particular the elasticity of both single-stranded and double stranded 
DNA is well know and the phase diagram of dsDNA is well understood. Experiments have 
permitted to denature dsDNA by applying a mechanical force, those experiments have taken 
the name of unzipping because the DNA is pulled apart from its two strands as a zipper (see 



figure 4.1 



These experiments are well understood in their single components: the ds- and ssDNA, the 
fork where the DNA denatures, what was lacking was a clearer picture how the delicate in- 
terplay of these different dynamics. 

After an introduction to the physics of its single components, we will develop a mesoscopic 
model for the coupled dynamics and describe a software package for its simulation. 
The goal here is to see whether the fluctuations an the correlations that compose the dy- 
namics of linkers and beads will affect the unzipping dynamics of the force. This has already 
been investigated in [Manosas 05j , however this approach is novel and has been published in 
[Barbieri 09] . 



4.1 Modeling fork dynamics 

The thermodynamics of DNA pairing is a subject that dates back to before the first sequencing 
techniques were available: a first model was proposed by Tinoco and collaborators in 1971 
[Tinoco TT] , it gave the free energies for the two types of Watson-Crick bonds and it remarked 
that further study was needed to take into account stacking interactions, which had been 
known to be the principal cause of DNA stability for some time then [Crothers 64| . 
In 1973 the same group published a new letter [Tinoco 73] where new data allowed for the 
introduction of stacking effects, that is to say that base-pairing free energies now depended, 
not only on the base itself but on the previous base too. However the results were not very 
precise and they involved RNA hairpins rather than DNA, it wasn't until the second half 
of the eighties that reliable data on DNA thermodynamics became available [Breslauer 86] . 
More recently similar data have been obtained in unzipping experiments. |Huguet 10) . 
The results of all of this studies are that the free energy of a DNA base pair depends on 
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Figure 4.1: Double stranded DNA can be denatured by applying opposite forces to the two 
strands. In clear analogy with the zipper commonly found in clothing, this type of experiment 
has benn christened unzipping. 
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go 


A 


T 


C 


G 


A 


1.78 


1.55 


2.52 


2.22 


T 


1.06 


1.78 


2.28 


2.54 


C 


2.54 


2.22 


3.14 


3.85 


G 


2.28 


2.52 


3.90 


3.14 



Table 4.1: Binding free energies go{bi,bi^i) (units of k^T) obtained from the MFOLD server 
for DNA at room temperature, pH=7.5, and ionic concentration of 0.15 M. The base values 
bi, are given by the line and column respectively. 



the base pair itself and its nearest neighbor nucleotide content, that is if we now consider a 
sequence of bases of dsDNA its free energy will be given by: 



N 



G{B,N) = Y,9o{h,bi+i), (4.1) 

i=l 

where B denotes the whole sequence and bi = A, T, C, G is the i^^ base. Typical values of the 



binding energies are given in table 4.1 



What we are interested in is the phenomenon of unzipping under a force, the denaturation of 
dsDNA when the two strands that compose its double helix are pulled. 

Let us now suppose for a moment we know the free energy of ssDNA under tension and that 
this is a linear function of the number of basis and otherwise depends only on the tension / 
applied to it. At equilibrium we will have that n bases of ssDNA have free energy equal to 
ngssif)- We will focus on the form of 5ss(/) in the following sections, it suffices to say that it 
needs to be an increasing function of force. 

If we model only the motion of the opening fork and we do not include in the model the 



experimental setup (see figure (4.4): stretching the two strands of DNA away from one another 
we are able to apply a force and eventually open a base pair. When will this happen? The 
energy gain from the two new ssDNA bases must be greater than what is lost from the dsDNA 
energy, that is: 

AG{i)=go{bi,bi+i)-2gM), (4-2) 
must be negative for the process to be energetically favored. 

It is important now to put some numeric values on the quantities involved: the free energies 
go and gss are both of the order of a few k^T, forces are expressed in units of pN and distances 
in units of nm. k^T ~ 4 pN nm. 

The typical range of an hydrogen bond is about 0.1 nm, since the critical force needed to break 
it is of about 15 pN, this works out to an energy of about 0.4 k^T which can be neglected 
with respect to the few k^T of the binding energy which is known thermodynamically. This 
means that the opening rate will be independent of force. 

Detailed balance then gives us the closing rate, which depends only on the force fluctuation 
needed to bring the two strands close enough to form the hydrogen bond. We then have: 

ro(n) = re^^Boin) ^ ^^(j) ^ ^^^Pg^sif) . ^4 3) 

where /3 is the inverse temperature and r gives the timescale of the phenomenon. We will 
refer to r as the attempt rate; it can be estimated from the rate of self-diffusion for an object 
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G(x) 




-fx 



Figure 4.2: The switching rate between the two states is proportional to exp(/3G*), where G* 
is the free energy barrier. The apphcation of a force / tilts the distribution and lowers the 
barrier of AG* ~ —fx*. Actual numerical values indicate this can be neglected with respect 
to G*. 



the size of a ssDNA base: = l32iTrjl'^ = 0.17 ps, where I = 5 nm is the size of a base and 
r] is the viscosity of water. 

The interplay of the the stacking and pairing energy qq and the energy gained from the 
two newly formed ssDNA bases 2gss is responsible for the formation of a complex energy 
landscape full of metastable minima, not dissimilar, to a one-dimensional random walk in a 
random environment, also known as the Sinai model. This suggest the use of methods from 
the statistical mechanics of disordered systems and from information theory for the description 
and analysis of such a system. 

In figure |4.3| we show as an example the free energy derived from the first 50 base-pairs of the 
A-phage DNA at two different forces. 

Cocco and collaborators first worked out the opening and closing rates in |Cocco OTllCocco 03| . 
as a Eyring-Kramers transition state theory [Eyring 35[ IKramers 40j . This theory describe 
the transition with a suitable continuous variable (which here is the separation x between the 
two bases forming the base pair), x obeys Langevin dynamics over an effective potential that 
is the free energy G{x). This potential has two local minima at the two equilibrium position 
that correspond to broken/whole hydrogen bonds (see figure 4.2). 
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G(n) 




Figure 4.3: Free energy G (units of AjbT) to open the first n base-pairs, for 200 randomly 
selected bases. 



4.2 ssDNA as a modified freely jointed chain 

One of the simplest polymer models possible is that of the freely jointed chain. The FJC is 
composed of monomers of length d, no constraint is put on the angles formed by consecutive 
segments and the excluded volume is not taken into account. 
The end-to-end distance is thus given by: 

N 

R = Y,n, (4.4) 

i 

where the fj are random vectors of length d. This length is often referred to as Kuhn length. 
If we are in the thermodynamic limit we can use the central limit theorem to show that the 
average end-to-end distance {R) vanishes and that it is distributed according to a normal 
distribution of variance {B?) = Nd^. 

To get this result [Kuhn 42l IJames 4^ we have assumed that no force was acting on one end 
of the chain, and it is, in fact, only valid around for end-to-end lengths of the order of y/Nd. 
In order to get the right result for high extensions we have to add a tensile force / applied in 
the X direction. We can now compute the average value of the x component of the ith link of 
the polymer as: 

f^^ Xiexp{f3fxi)dxi 
I-d exp{Pfxi)dxi 

Where C{x) = coth(x) — 1/x is the Langevin function. The total length of the polymer along 
the X axis is then given by Lfjc(/) = Nlpjcif)- The interested reader can find further details 
in a classical reference such as [Flory 53| . 

At the beginning of the nineties it became possible to measure the elasticity of DNA with 
magnetic beads [S mith 92] . It then became apparent that, up to forces of 20 pN/nm the 
elasticity of ssDNA is well fitted by a FJC model, but even better results are obtained using a 
modified FJC where the monomers are extensible at high forces and where the contour length 
{i. e. the total stretched length of the polymer) is not given by the product of the number of 
momomers and the Kuhn length: 

;MPac(/) = toe (l + = <i ("th(W^ -^){^ + £) (l-f) 
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^Fjc(/) = {x^) = ^^f^-^ = dCiPfd) , (4.5) 
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where d = 0.56 nm, b = 1.4 nm and = 800 pN [Smith 96] . 

4.3 dsDNA as an exstensible worm-like chain 

The worm-hke chain (WLC) is one of the simplest continuous models of a polymer: if we 
define a parametric curve in space r(s) we can define it's tangent vector as t = and its 

curvature vector as w = we can further impose that the polymer is inextensible, that is: 
\vect{s)\ = 1. 

Then we can give the internal energy for a polymer stretched by an external force / as: 

f3E= / ds-\wis)\^-/3fi{s)-x, (4.7) 
Jo ^ 

where A is the persistence length, that turns out to be the correlation length of the direction 
of the polymer at zero force. 

The WLC is analytically solvable model, however the solution can only be written as an 
infinite series ??. Luckily a very precise numerical fit has been proposed by Marko and Siggia 
in [Marko 941 IMarko 951 : 

koX. 4(1 - l-WLc/ltotY 4 

where /tot is the contour length of the polymer divided by the number of bases, A is the 
persistence length and /wLC is the length of the polymer in the direction of the force /. 
In the following years even more refined fits to the experimental data have been proposed 
such as the one by Moroz and Nelson [Moroz 97] which used a formula first proposed by 
Odijk [Odijk 95] . Their formula can fit the experimental data for the elasticity of dsDNA 
for a very large range of forces [Bouchiat 99| . thanks to the relaxation of the hypothesis that 
\t{s)\ = 1, which plays an important role at high forces and the inclusion of torsional effects. 
However we do not need such a large range of forces for the description of unzipping exper- 
iments; because of this that in the following we will use a simplified version of the Odijk 
formula, namely: 



Wlc(/) — ^tot 

where Ztot = 0.34 nm A = 48 nm and 7ds = 1000 pN 



1 - - (/3M)-^/^ + 
2 7ds 



(4.9) 



4.4 Two possible ensembles 

The description we have given above does not depend much on the experimental setup, the 
only time where we have lost some generality is in the description of fork dynamics, where we 
have assumed the force to be fixed; however both the polymer description and our choice for 
the dynamics are completely independent of details like this. 



In the following we will outline two possible experimental setups (pictured in figure 4.4): in 
the first force is a parameter and the extension of the polymer, which is directly related to 
the number of open bases, is measured; in the second the distance between two optical traps 
can be varied as a parameter and the displacement of the beads in the traps can be measured 
to give a precise measurement of force. 
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The only detail that needs to be sorted out is the change in variable in the thermodynamic 
potentials that describe different setups, but this can be easily done through a Legendre 
transform. Before we go on we should lay out the notation we will use in the following: first 
of all capital letters denote extensive quantities, while lower case letters correspond to the 
equivalent intensive quantity, x is the end to end distance of a polymer and / = x/n, where 
n is the number of momomers (bases here). For example W{x) = nw{l) 
Let's lay out all the quantities: 

• g{f) is the free energy per base as a function of force. 

• /(/) = '^^qP is the length as a function of force. 

• w{l) = maxf[fl — g{f)] the free energy as a function of length. 

• /(/) = the force as a function of length or the inverse of 1(f)- 

• k{l) = ^-ijp- is the effective spring constant for a given length. 

• -TT-pr = ^^|tt^ is the reciprocal effective spring constant as a function of force. 



4.4.1 Fixed force, magnetic tweezers 

At the beginning of the 2000s Gosse and Croquette |Gosse 02] developed a technique called 
optical tweezing: a superparamagnetic bead with a diameter of the scale of the pm is placed 
under the two poles of a permanent magnet, which creates a magnetic gradient. 
The distance between the poles of the magnet (less than 1 mm) is fixed so that on the scale of 
the typical movements of the bead the gradient of the magnetic field is almost constant and 
so is the force applied to the bead. 

Magnetic beads have a preferred direction. This is at the same time an advantage and a 
disadvantage: the advantage is the possibility of applying a torque to the bead, which has 
opened the door to experiments involving the coiling and uncoiling of DNA; on the other 
hand the DNA will bind on a random point of the surface and it is impossible to say exactly 
where. Given the relative size of the bead and of a single base of DNA, this means that the 
unzipping experiment can start up to 1000 bp away in two different runs. 
The position of the bead can be recorded optically. This type of experiments are relatively 
easy to set up, and the modellization of fork dynamics at fixed force is perhaps more intuitive. 
On the other hand fixed force experiments tend to be ill suited for sequencing purposes, since 
it is difficult to control the position along the energy landscape where the fork will stop. 
For a given portion of the sequence there exists an average critical opening force. When the 
critical force is exerted the fork will fiuctuate around a given number of open bases for a long 
time because it is in a potential well. On the other hand the top of the potential barriers that 
separate these wells are very hard to sample, because very little time will be spent there. 
Another reason why this method is not very well suited for sequencing through unzipping is 
that the position of the fork for a given force depends strongly on the sequence and it is very 
hard to generate an unzipping protocol with varying force without a prior knowledge of the 
sequence. 
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Figure 4.4: Typical experimental setups that will be described in the following. A) A setup 
with two optical traps (beads xi and 2:4) drawn as springs and whose centers are the black 
vertical lines; B) a setup with a single magnetic bead X3 that applies a constant force on 
the molecule attached to a fixed "wall". In both cases the molecular construction is made 
by a DNA molecule that has to be opened (therefore one should include two single-strand 
linkers that arc the opened parts of the molecule) and one double-stranded DNA linker. The 
coordinates Xj are the distances of the corresponding points from the left reference position 
(which is the center of the left optical trap in case A and the fixed wall in case B). 
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Figure 4.5: Several typical fixed force unzipping traces from [Danilowicz 03| . Solid lines 
correspond to a force of 15 pN, while dashed lines correspond to a force of 20 pN. The 
measured quantity is the distance between the center of the magnetic trap and the surface 
of a glass micropipette which the DNA is attached to. Horizontal plateaux correspond to 
minima of the free energy. 
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4.4.2 Fixed distance, optical tweezers 

Pioneering studies on the effect of radiation pressure from laser light on micrometer-sized 
dielectric beads were performed at the beginning of the seventies by Ashkin [Ashkin 70] . A 
few years later the same Ashkin developed a single beam technique for trapping dielectric 
beads [Ashkin 86] . 

In optical tweezers a tightly focused laser beam passes through a dielectric sphere which has 
an optical index higher than that of the surrounding fluid. The incoming light from the laser 
is refracted by the bead causing a change in the momentum of the outgoing light; because of 
the conservation of momentum, the bead will experience a change of momentum of opposite 
sign. 

For high enough numerical aperture of the laser there exists a stable position of the bead 
along the axis of propagation of laser light, on the other hand stability along the transversal 
directions is due to the intensity profile of the laser, which is most of the times Gaussian. 
In order to give a precise description of the phenomenon for the conditions most often used 
in micromanipulation experiments, we should take into account the full Mie theory of light 
scattering, since the bead size (1 pm) is very close to the wavelength of the laser employed 
(see for example [Mangeol 08| , where the laser wavelength is 1.064 pm). 
On the other hand we can give an hand- waving argument for the stability of the trap using ray 
optics: a particle with a refractive index higher than water will act as a positive lens, roughly 
speaking if the bead is placed before (after) the focal point the rays will diverge (converge) . 
If the lens converges the ray the light will have more momentum in the direction of propagation 
of the beam, conversely, if the beams have been diverged the light will lose momentum. 



See figure 4.6 for a schematic picture. The interested reader should refer to Kerker's book 
[Kerker 69] for a full treatment of the Rayleigh and Mie regimes. 

The use of optical traps for the manipulation of biopolymers is compelling because it allows 
to fix the position of the beads and to measure the force exerted on the molecule. This is very 
attractive for unzipping experiment because it gives us a chance to focus on a specific region 
of the sequence, while in fixed force experiments the region of DNA where the fork will spend 
most of the time depends on the sequence itself. 

The measurement of force is obtained by the observation of the displacement of the bead 
with respect to the center of the bead. The optical trap is well approximated by an harmonic 
potential around its equilibrium position. The displacement of the bead can be measured 
either by direct observation of the diffraction pattern through a microscope, or by measuring 
the deflection of the laser beam with a PSD jWallmark 57] . 
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Figure 4.6: The effect of the refraction of hght on a dielectric bead in the ray optic approx- 
imation. The light propagates from bottom to top. Fi and F2 are the forces acting on the 
bead because of the concentration of momentum, Fnet their resultant. 
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Figure 4.7: A typical fixed distance unzipping trace from |Mossa lOj . The force, measured as 
the displacement of the bead in the optical trap, is measured as a function of time. Notice 
how the three minima of the free energy correspond to the three green lines where the bead 
spends most of its time. 
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4.5 Overdamped dynamics 

The motion of very small objects suspended in a liquid does not resemble much to that of 
objects in everyday life. The most striking features are the absence of inertial effects and 
Brownian noise. 

A common way to quantify the ratio between inertial and viscous effects is the Reynolds 
number Re, which was introduced by Stokes [StokeilST], several years before Reynolds popu- 
larized it. 
It is given by: 

Re=^, (4.10) 

where V is the mean velocity of the object with respect to the fluid, p is the density of the 
fluid, I is the linear size of the object and r] is the viscosity of the fluid. 

It appears in the dimensionless Navier-Stokes^ equation for an object immersed in a Newtonian 
fluid as: 



Rej^— +v-Vvj =-Vp + VV + f (4.11) 

where v is the speed of the object divided by F, p is the pressure of the fluid divided by VjV/l^ 
f are the external forces per unit volume divided by "qV/l"^, V stands for the the space partial 
derivatives vector multiplied by / and finally d/dt is the time derivative multiplied by l/V. 
The importance of the Reynolds number is that it is the only quantity needed to describe the 
flow of a fluid, that is to say that once the variables have been properly rescaled sistems of 
different size, viscosity and density will behave the same way. 

It is customary to categorize the characteristics of the flow according to the Reynolds number: 

• Re ^ 1: Turbulent flow. Inertial forces are dominant. 
E.g. man swimming, the wing of a plane. 

• Re ~ 1: Laminar flow. Viscous forces dominate. E.g. water in a pipe. 
E.g. blood flow, flsh swimming, man swimming in glycerol. 

• Re <C 1: Creeping flow. Inertial forces are completely negligible. 

E.g. Bacteria in water, pm-sized beads in optical traps, macromolecules in solution. 

Among the objects that we will consider in the following those who have the largest Reynolds 
number are the beads in the optical traps; for them Re ~ 10~^, because of this, the remarks 
we will make on their dynamic behavior will be all the more valid for objects with lower 
Reynolds number. 

Let us suppose that a bead of diameter d, is suspended in water by an optical trap of stiffness 
k. Let us also suppose for the moment that the bead has the same density as the water 
surrounding it. 

The bead obeys the Langevin harmonic oscillator equation: 

mx + 'yx + kx = £,{t) , (4.12) 



^Please note that this is not the only way to rescale the variables in order to make the adimensional: pV^ 
has the dimensions of a pressure and can be used to the same effect, it turns out this latter is the right scaling 
for high Reynolds numbers, while the one in the main text is the right one for the limit of low Re. 
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Figure 4.8: The amplitude response as a function of frequency: the blue curve corresponds 



to eq. (4.14 while the violet one corresponds to eq. (4.15). For this plot we have chosen 
^ = 100, this way the cutoff frequency is well below the frequency where the mass effects 
become dominant. 



where m = l/Girpd^, 7 = /SGirrjd and ^(t) is Gaussian noise obeying 

m) = 0, mm) = 2^k^T5{t - 1') . (4.13) 

We now consider the frequency response by performing a Fourier transform obtaining 



muj 



The question is: when can this be approximated by its Brownian counterpart, neglecting the 
mass term? 

The new equation for the frequency response would read: 



1 



(4.15) 



Now this response has a cutoff frequency of^/k, what we want, in order for our approximation 
to be valid, is for this frequency to be much smaller than the one at which mass effects become 
important, that is m/7. Summing up we want 

mk Ipk 



> 1. 



(4.16) 



7^ (67r)'^ry2 

Plugging in realistic values for the stiffness of the trap /c = 0.5 pN/nm, for the density p = 1 
g/cm^, the diameter of the bead / = 1 pm and the viscosity of water 8.9 10~^ Pa s; we find 
the ratio to be very small: ^ = 1.9 10"! This is in accordance with what we would have 
expected by using the Reynolds number, in fact pm-sized beads are well within the creeping 
flow range for speeds up to 10 cm/s. 

In the following discussion we will be well justified in leaving out the mass terms from our 
equations and considering all of the dynamics as Brownian or overdamped. 
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4.6 Coupling all the dynamics together 



In this section we will derive an effective mesoscopic dynamical equation for coupled het- 
eropolymers. The details of the calculation are somewhat technical, but they offer a different 
insight from the derivation published in the appendix of [Barbieri 09| . 

Because of the preceding discussion on overdamped dynamics we will ignore all inertial effects. 
In addition to this simplification we will consider the simplest polymer model: a chain of sim- 
ple Hookean springs, also known as the Rouse model [Rouse Jr 53j . We will also consider the 
model to be effectively one dimensional. 

What we hope to understand better here is how movement propagates along a heteropolymer, 
what kind of fluctuations and correlations are important and how. It is also of interest to 
know whether the polymer can be considered at equilibrium and what are the relaxations 
times. 

We will show that in a mesoscopic description where we do not describe single monomers 
the noise is not decorrelated and we will propose a way to implement these characteristics in 
computer experiments. 



4.6.1 Scaling of a homogeneous Rouse polymer 

Let us now derive the equations for the simplest case: that of a homogeneous polymer. At 
flrst we will derive the equation for the free end of the polymer and then we will concentrate 
on a midpoint to see how the dynamics are coupled, we will see of this leads to a viscous drag 
matrix on the left hand side and how this translates into fluctuation dissipation relations for 
correlated noise. 

Each monomer is characterized by its spring constant k and its viscous drag coefficient 7. Let 
us suppose that a chain of identical springs is connected to a non moving wall on one end 



and that a constant force / is exerted on the other end. The setup is shown in figure 4.9 
The monomers will then obey this system of simultaneous equations: 




Figure 4.9: A homogeneous Rouse polymer composed of identical springs and beads each 
having spring constant k and viscous drag coefficient 7. The coordinates Xi are taken along 
the direction of the pulling force /. 
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7x1 



-2kxi + kx2 + rji 



2kXn + kXji—l + kXn-\-l + ?7ri 



(4.17) 



1XN 



-kxN + A:xAr_i + f + riN , 



where Xn is the coordinate of the n^^ hnk. r/„ are uncorrelated Gaussian noises of zero average 
and autocorrelation function: 



{Viit)rjjiO)) = 2^kBTS^j6it) . 



Equation (4.17) can be solved formally for xi in terms of integrals of X2- Thus: 



xi[t) = - e T ^ ' 
7 Jo 



[kx2{t') + m{t')] 



(4.18) 



(4.19) 



This doesn't bring us any closer to solving the system of equations, in fact iterating this 
procedure will only produce an integro-differential equation of order A^. To make the problem 
tractable we have to solve it in the limit in which the ratio 7//C is small, which is reasonable 



given that for ssDNA at typical conditions it has the value of approximately 10 s, many 
orders of magnitude below experimental resolution. 



Equation (4.19) thus becomes: 



Mt) = \ [X2{t) + 771 (t)] - ^Xl(t) + O (I 



Substitution of this into the equation for X2 yields: 



^-px2 = -{k + ^k)x2 + kxs + ri2 + 



(4.20) 



(4.21) 



The fluctuation-dissipation theorem for Brownian dynamics of the form jx = —'W{x) + rj 
states that, in order for Boltzmann equilibrium to be attained the following relation must be 
verified: 

ivitMO)) = 2^kBT6{t) . (4.22) 



For equation (4.21) this translates to: 



2--fkBT6{t) . 



(4.23) 



This means that our approximation is consistent with the fluctuation-dissipation theorem. It 
is obvious that the iteration of this procedure will define renormalised k and 7 and new rjn 
which will be correlated. We can write: 



Jn-lXn-l = -{k + kn-l)Xn-l + kXn + Vn-l 

Then solve this equation with the usual approximation finding: 



Xn—l 



In-lk 



k -\- kn—l 



I Xfi -\- 



1 



(4.24) 



(4.25) 
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which must be inserted in the equation for Xn- 



7 + 



{k + kn-lY 



+ rin + 



k 

Vn-l 



(4.26) 



k ~l~ kfi—i 

thus defining recurrence relations for the coefficients: 

kkfi—i 



kry 



7n = 7 + 



k ~\~ kn~i 

In-lk"^ 



{k + kn-if 



iv'nitWnm = {Vnit)Vnm + 



{k + kn-l)' 



:{Vn-litWn-im 



(4.27) 
(4.28) 
(4.29) 



Applying the fluctuation dissipation theorem to the last equation shows that we have chosen 

the only approximation consistent with the preceding equation. That is to say that the 7's 

on the left hand side obey the same recurrence relations as the Brownian noises. 

As we have already calculated the values of the constants for n = 2 we can easily solve the 

recurrences: 



kn 



7n 



1 



n 



(2n+l)(n + l) 
6n 



This way we can rewrite equations (4.26) and (4.25) as: 
(2n + l)(n + l) 



6n 



n — 1 

Xfi—l — X 



k + - ] Xn + kXn-\-l + Tin ) 

n J 

(2n- l)(n- 1)7 



n — 1 

K nk 



n 6n 

The recurrence can be completely closed with the help of the equation for xn as: 



(2iV + l)(iV + l)^ 
6iV 



-IX N 



k_ 

'n 



XN + f + v'n- 



(4.30) 

(4.31) 
(4.32) 



(4.33) 
(4.34) 

(4.35) 



Not surprisingly we recover the scalings of the Rouse model when it is solved in the continuous 
n limit (see for example |Doi 86] ) in that it gives: 



-IXN 



k_ 

'n 



XN + f + 'n'N 



(4.36) 



What we would like to explore now is what happens to a subpolymer, i. e. write down the 
evolution of one end of the polymer and of a midpoint, integrating out all other degrees of 
freedom. To do so we need to start from the (A^ — l)*'^ link of the polymer. 



'~iXN-l = -2kxN-i + kXN-2 + kXN + VN-l ■ 
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which gives, after the usual procedure: 



1 ^ Tjj^—i /"y 

XN-l = 2 [xN-2 + ^ [xN-2 + Xn] + ° 



(4.38) 



which can now be used in the {N — 2) and equations yielding: 



5 1 

-7X7v_2 + --yxN 

5 1 

-JXN + ^7XN-2 



+ XN~2 + ■^XN-2 + kxN-3 + VN^2 + ^^^^ (4-39) 

-\xN + ^XN-2 + f + m + '^- (4.40) 



If we define: 



^nXN-n + InXN = -{k + kn)xN-n + Kxn + kxN~n-l + VN- 
-/^XN + %XN-n = -knXN + knXN-n + / + , 



(4.41) 

(4.42) 



we can solve the first to get recurrence equations: 



kn k j ^nk-n 

XN-n = ^Xn + ^XM^n-1 



^nk 
{k + knY 



XN-n-l + 



+ 



(/c + fcri)^ k + kfi 



XN 



VN-n , (1 

k + kn ^k 



(4.43) 



and deriving: 



XN- 



kn 



■^XN + 



^XN-n-l + 0[j 

k + kn ^k 



{AAA) 



k ~\~ kji 

These last two expressions need to be used in the equation for the (A^ — n — 1)*^ link and in 



equation (4.42) to define the recurrence relations: 



T^^A^V , ( Inknk , Ink 
7 + ^TTT I XN-n-l + \ T, ^TTT + : — I ^^A^ 



{k + knY 



[k -\- kfi)'^ k + kn 



kk'fi \ kkfi ( k 

+ 7 iT I ^N-n-\ + 7 T^XN + kXN-n-2 + ( W-n-1 + T f^VN-n ] ] 

K + Kn I k -\- kn \ k -\- kn 

' b , 27„A:„ 7n^n V , ( Inknk Ink \ . 
In + -, — + TTTT \XN + \ TTTT + : — I XN-n-l 



(4.45) 



k "h kfi {k -\- kfi^^ 



{k "h kfi)'^ k -\- kfi 



kkfi kkn , n , I , kn 

f^XN + ^XN-n-l + / + \Vn + 7 f^VN-n 



(4.46) 
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and then: 



k -\- krfi 



a 1.2 



In+l = 7 + 



{k + knf 



7n+l 



I'nknk ^ ^nk 



ik -\- kfi)'^ k -\- k^i 



h h . '^Inkn llikn 
In+l = 7n + — + 



k + kn ik + kn)"^ 

A;2 



+ 



(^h -\- k^] 



(7y^_„_l(t)4"+^)(0)) 



;{fiN-n{t)riN~n{0)) ] 



;{fjM-nit)VN-niO)) 



+ 



(k + kn)^ 



+ 



(A^ -|- kf^ 



k ~\~ kf] 



;{VN-n{t)fiN^n{0)) 



(4.47) 
(4.48) 
(4.49) 
(4.50) 

(4.51) 
(4.52) 



(4.53) 



Which are quickly solved as: 



In 



7n 
it 



n 

(2n+l)(n + l) 
6n 

(n + l)(n - 1) 
6n 

(2n + l)(n + l) 
6n 



-7i 

7; 

7' 



(4.54) 
(4.55) 
(4.56) 
(4.57) 



This enables us to rewrite equations (4.41) and (4.42) as: 



(2n+l)(n+l) (n+l)(n-l) /, k^ 

JXN-n + ^ IXN = - (k + - ] XN- 

bn bn \ n , 

k 



+ -XN + kxN~n-l + VN- 

n 

(2n+l)(n+l) (n+l)(n-l) 

— IXN H ^XN-n = 



6n 



bn 

+ -XN-n + f + Vn^ ■ 

n 



k 

~XN 

n 



(4.58) 



(4.59) 
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Substitution of equation (4.34) in equation ( |4.58 ) yields: 

1) . 



2Nn{N-n)+N {n + l){n 
-IXN-n H 



Qn{N — n) 



k 



+ —XN + 

n 



Qn 
N -n 



- 1 



k k' 
— + - \ XN- 

1\ — n n . 



(4.60) 



-??Ar_„_l + VN-n 



N -n 

This defines a system of two coupled equations for xj^ and x^-n which cannot in general be 



decoupled because the coefficient matrices of 
to one another. 

( 2Nn{N-n)+N (n+l)(n-l) 
en{N-n) 6n 
(ft+l)(n-l) (2n+l)(n+l) 
6ra 6n 



XN-r 



7 



and 



XN-r 
XN 



1 

N-n 



1 



n 



XN-n 
XN 



+ 



N-n-1 ^/ _|_ ~ 

N-n 'IN~n-l 'iN-n- 



where r]N- 

These equations can be rewritten in the large N limit as 



1 a 

3 6 

a a 

6 3 



1—a a 



XN{l-a) 
XN 

k I X 
N 



N(l-a) 
XN 



+ 



where we have defined a 



XN-r 
XN 



are not proportional 



+ 



VN-n 
An) 

Vn 



+ 



llNil-a) 
JaN) 

In 



(4.61) 



(4.62) 



N- 



4.6.2 Scaling of a non-homogeneous Rouse Polymer 

The efTect of a single intermediate dishomogeneity Let us now suppose that one of 
the links that compose our polymer has a much greater viscosity than its neighbours, which 
we leave homogeneous. We wish to investigate this kind of setup because it will give us some 
insight on how the attached DNA hairpin affects the fluctuations of the linkers and whether 
or not it decorrelates them. 



What we are planning to do is to write two coupled equations as in equation (4.61), namely 
for the A^**^ and the (A'' — n)**^ links when the {N — n + l)**^ has a much greater viscosity 



than the others. In what follows we will indicate with F as opposed to 7 the viscosity of the 



different link. The setup is shown in figure 4.10 



Looking at equations ( 4.47|4?50 ) it is immediately apparent that the only one which involves 
the viscosity of an intermediate link is the one for 7", namely equation (4.48). Retracing the 
steps that brought us to equations ( |4.61| ), we have to correct equation (4.55) as: 

(2n-l)(n-l) 



7n 



6n 



-7 + r. 



(4.63) 



This change propagates in equation (4.58) but not in equation (4.59), and in turn equation 



(4.60) becomes: 



^2n(A^ 

k 



3){N -n) + N 



6n{N 
k 



7 + r XN-n + 



(n + l)(n - 1) 



A^- 



+ 
n n 



- n) 

k N-n 

XN-n + -XN H 77 

n 1\ — n 



6n 



-IXN 



1 



(4.64) 



v'n- 



n-l 



+ VN- 
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Figure 4.10: A non-homogeneous Rouse polymer composed of N identical springs and — 1 
beads each having spring constant k and viscous drag coefficient 7. The n*^ bead is taken 
to have viscous drag coefficient F. The coordinates Xi are taken along the direction of the 
pulling force /. The node with higher viscosity represents the DNA hairpin to be opened in 
a typical experiment. 



we have to underline that the second noise term has also changed in order to fullfill fluctuation- 
dissipation relations. 



The correlation matrix of the noise terms in equation (4.61) becomes then: 



2n{N-3){N-n)+N ^ (n+l)(n-l) ^ 
' 6n ' 



6ri(Af-?i) 
(n+l)(n-l) 
6n I 



(2n+l)(«+l) „ 
6n ' 



(4.65) 



which can be rewritten in a clearer form in the limit of 



00 with = a as: 



f7 + r ^7 



(4.66) 



Block polymers Suppose we have a polymer composed of two sections: the first composed 
of rii links of viscosity 71 and elasticity ki, the second of n2 links of viscosity 72 and elasticity 
k2- In close resemblance with what we did before we ask ourselves how this modifies the 
equation for the effective evolution of the floating extremity and of the point where the two 
sections are linked. 

It is important to know this because most DNA unzipping experiments so far have relied on 
linkers of both single- and double-stranded DNA bonded in heteropolymers of various lengths. 



Equation (4.34) involves only links of the first tipe and can thus be easily rewritten with 



the additional index. We may think that equations (4.58) and (4.59) share the same fate but 



a different index; unfortunately this is true only of the elasticities. The viscosity of the link 
that connects the first and the second section (that is the n^i ) is of the first type. Equations 
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Figure 4.11: A non-homogeneous Rouse polymer composed of ni identical springs and beads 
each having spring constant ki and viscous drag coefficient 71 and n2 more identical springs 
and beads each having spring constant k2 and viscous drag coefficient 72. The coordinates Xi 
are taken along the direction of the pulling force /. One can imagine momoers of type 1 to 
be ssDNA and those of type 2 to be dsDNA. 



(4.58) and ( |4.59 ) then become: 

/(2n2-l)(n2-l) , \. , (n2 + l)(n2-l) . 

, ^2 \ ^2 

^2 H I Xjii H Xni+n2 + "'l^ni— 1 + ^rti 

(2n2 + l)(n2 + l) . , (n2 + l)(n2-l) . 

72Xni+n2 H ^ 72Xni = 

677-2 6n2 

k2 k2 ,(n2) 



(4.67) 



(4.68) 



-X 



J- J- f J- ~V"2 



n2 n2 

Putting all back together gives us two matrices: one for the 7's and the other for the fe's: 



(2ni+l)(ni+l) ^,^ _L (2n2-l)(n2-l) (n2+l)(«2-l) 

^2 (2n2 + l)fn2+i: 

bn2 bn2 



6ni 71 + 6712 ">'2 6?i2 ) /-^ cq\ 

(n2+l)(n2-l)„. (2n2 + l)(n2+l)„. ' l^^.uy; 



(fc]^ _|_ fc2 _fe2 \ 
rii ' n2 n2 \ 
_k2 k2 I ' 

"2 712 / 

the former can be rewritten in the limit of ni, n2 large as: 

^7i + f72 ?72 

f72 f72 



(4.70) 



(4.71) 



Validity of the approximation In the beginning of this section we have stated that the 
microscopic time-scale for ssDNA is given by 7/A: = 10~10 s. Now by looking at the scaling 
of the mesoscopic timescale we'll obtain the range of validity of our aproximation, that is the 
timescale at which a polymer will continue to behave as a single entity and the propagation 
time along the polymer will be negligible. The scaling of the macroscopic time is proportional 
to 'y/kN'^/3 where N is the number of monomers. 

Given that as of today the state of the art in experiments the maximum sampling frequency 
is of the order of 10 kHz, polymers larger than 1000 base pairs have relaxation times that are 
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observable. 

To take this into account in mesoscopic simulations we can split long polymer into smaller 
pieces even though this doesn't appear to have an appreciable effect on measured relaxation 
times. The interested reader should refer to section 5.1 of jBarbieri 09| . 

Moreover in [Barbieri 091 Appendix A], a more formal discussion of the normal modes of a 
single homogeneous polymer is given. It turns out that the factor 1/3 that appears in our 
equations is an approximation of the true factor 7r^/4 that would appear in an exact treat- 
ment. Here we have preferred to give this approximated result is the only one that can yield 
the scaling for the off-diagonal terms, and extends well to non-homomgeneous polymers. 



4.6.3 Detailed balance 



Now that we have described all the different pieces of the dynamics of DNA unzipping we 
would like to derive a coupled mesoscopic dynamics that respects detailed balance equations 
with the right thermodynamic equilibrium distribution: 



P(n, x) 



-l3W{x,n)-(3Go{n;B) 



/z, 



(4.72) 



where Go{n; B) = (jfo(6i, 6i+i) is the binding energy of the fork, and W{x,n) is the free 
energy of the linkers, the beads and the traps, but we need not concentrate on the details for 
now. 

This is not a trivial task because we have to take into account the coupling between a contin- 
uous time Markov chain (the fork dynamics n), and the Brownian dynamics of the polymers 
and the beads. 

Let us first identify the possible events at each time step, the fork can either open, close or 
stay where it is at each time step, and the x variable will perform a Langevin step of size Ax. 
We have identified three transitions that correspond to three detailed balance equations: 



P{n, x)Ho{x, n, Ax) 
P{n, x)Hc{x, n, Ax) 
P{n, x)Hs{x, n, Ax) 



P{n + l,x + Ax)Hc{n + l,x + Ax, -Ax) ; (4.73) 
P{n - 1, X + Ax)Ho{n -l,x + Ax, -Ax) ; (4.74) 
P{n,x + Ax)Hs{n,x + Ax,-Ax); (4.75) 



where o, c and s denote respectively open, close and stay, and H are the transition rates. 
If we now suppose, as we have discussed previously, that the opening rate depends exclusively 
on the binding energy, and we further impose it to be a product of the opening rate and a 
Langevin step we get: 



H,{x,n,Ax) = ^Ate^^("'^)-^«("+i'^) 



UnTAt 



In 



exp 



In 



ATAt 



Ax 



f{x,n)At\ 



In 



(4.76) 



that is consistent with the definition of Tq defined previously if we integrate over Ax. 
This, in conjunction with equation (4.74) gives the closing rate: 

R^{x,n,Ax) = ^Ate^^("''^)-''^("+^^'"-^) 



/47rrAt 

7n-l 



exp 



AAt 



Ax + 



f{x + Ax,n- I) At 

7n-l 



(4.77) 
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The problem is that this rate depends on quantities computed both in x and x + Ax and it is 

not Gaussian for general /. On the other hand if we impose /(x, 71) — ^dx ^-'^^ perform 
a Taylor of the terms that are calculated in x + Ax expansion at the exponent we get for the 
W part: 



W{x, n) — W{x + Ax, n — 1) = 

W{x, n) - W{x, n - 1) + W{x, n - 1) - W{x + Ax, n - 1) 



W{x, n) - W{x, n - 1) + /(x, n - l)Ax 
and for the Brownian step: 



7n~i ( ^ /(x + Ax,n- l )At^ 



d'^Wix,n-l) 
dx"^ 



(4.78) 



(Ax)2 + 0((Ax)=^); 



4Ai 

7n-l 

' 4At 

7n-l 



7n-l 



Ax 



/(x + Ax,n - l)At 



7n-l 



/(x,n-l)AiV ^/ 
Ax-^'-^^ — j -/(x,n-l)Ax + 



/(x + Ax,n- l)Ax = 

d'^W{x,n-l] 



(4.79) 



AAt V 7n-l 
+0(AtAx) + 0((Ax)^) 



dx'^ 



(Ax) 



Now we only have to notice that terms up to and including order Ax cancel out and that for 
Brownian motion At ~ (Ax)^, to see we can rewrite the rate as: 



i^c(^,n. Ax) = rAte^^(^''^)-^^''("'"-i) 



lA-KTAt 

7n-l 



exp 



AAt 



Ax - 



/(x + Ax,n- l)At 

7n-l 



(4.80) 



which is now consistent with the definition of rc by integrating over Ax. 
The attentive reader should note that the force in the Brownian step is computed in n — 1, 
that is once the base has been closed, this has important consequences on the implementation 
of the algorithm. 

Finally, the rate at constant n is obtained by imposing that: 

J dAxHs{x, n, Ax) + Ho{x, n, Ax) + Hc{x, n, Ax) = 1 , (4.81) 



that is: 



Hs{x, n, Ax) = [1 — ro(x, n) — rc(x, n 

In 



UirTAt 
X \ / exp 

V In 



ATAt 



Ax 



/(x,n)At 



In 



(4.82) 



Now the algorithm can be summarized: 



p=randomr eal ( ) 

if (p<r_open (n , x) ) { 

x+=f (x , n) *dt /gamma+r andomgaus s i an ( ) * sqrt ( 2*bet a*dt *gamma) 

n++ 
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> 

else if (p<r_open(n,x)+r_close(n,x) ){ 
n — 

x+=f (x,n) *dt/gainma+randomgaussian() *sqrt (2*beta*dt*gamina) 

} 

else{ 

x+=f (x,n) *dt/gainma+randomgaussian() *sqrt (2*beta*dt*gaiiuiia) 

} 

Note how the order of the Brownian step and the opening or closing is reversed, as we have 
underhned before this is essential to the satisfaction of detailed balance equations. 

4.7 Results from the dynamical model 

We have spent the best part of the previous chapter defining an effective mesoscopic dynamical 
model for DNA micromanipulation experiments. Our approach is much more complex than 
separately simulating fork and polymer dynamics: first because it does not imply equilibrium 
and secondly because it allows for cross-correlation effects between fork, beads and polymers 
dynamics. 

In this section we wish to turn to the novel measurements that we have been able to perform 
thanks to this software and that were published in [Barbieri 09j . 

First of all we have observed that for complex polymers the expression W{1) = Nw{l) for 
the free energy breaks down at low A^. This was immediately clear when we observed that 
the measured sojourn times did not match the theoretical prediction from the Boltzmann 
distribution, however, the effect is much smaller even simply adding the nonlinear dependence 
in coming from the square root term in: 

^-m(x,n) ^ ^-l3Nw{xlN)^ PHx/N)f ^ ^^^^^^ 

where ^ is a dimensional constant of no importance, and k was defined previously as the 
second derivative of w with respect to / = x/N . 



In figure 4.12 we show the effect of the square-root term in the case of an uniform sequence: 
the time spent on a basis is obtained by simulation with and without the square-root term 
and by its Boltzmann estimate. 

Another set of quantities which is in general not available from first principles computations 
are correlation functions, in [Barbieri 09j we have studied in detail the dependence of the 
correlation functions on the number of open bases and on the length of the linkers in various 
experimental setups in order to determine the importance of out of equilibrium effects such 
as propagation times. 

In the following we will concentrate on a setup similar to that used in Bockelmann's lab at 
ESPCI: two optical traps of stiffness 0.1 pN/nm and 0.512 pN/nm respectively, a dsDNA 
linker of 3120 bases and a ssDNA linker of 40 bases. 

We have found that polymers which are shorter than 1000 bases show no appreciable effect 
due to finite relaxation times. For longer polymers we have devised a way of introducing the 
propagation effect: we cut up the polymer in pieces which are at most 1000 bases long and 
we simulate them separately. 



This is shown in figure 4.13, However we have found this to have an effect on the shape of 
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Figure 4.12: Average fraction of time spent on each base. The full (blue) curve corresponds 
to Eq. (4.83) while the dashed (black) curve corresponds Eq. (4.83) without the saddle-point 
corrections (the square-root term). The dot-dashed (red) line is Peq{n) oc exp[—nAg] with 
Ag = 0.006. n is the number of open bases. 
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the correlation function, but not on the correlation time. In fact, if the correlation function 

is fit with a stretched exponential of the form: exp[— (t/r)''], j3 is slightly smaller. 

We have also been able to study the dependence of the relaxation time for different parts of 




Ie-07 



le-OI 



Figure 4.13: Correlation functions for the setup in figure 4.4 A at two different values of the 



number of open bases, A^eq = 40 + n. ss and ds indicate the autocorrelation functions of the 
ssDNA and dsDNA linkers and xi and X4 are the autocorrelation functions of two optical 
traps of different stiffnesses (3:4 being the stiffest). 



the setup as a function of the number of open bases and to compare those with theoretical 



results. This is shown in table 4.2 and in figure 4.14 



The results are in very good numerical agreement except for the two beads: it turns out 
that the relationship between the bead and the number of open bases is more subtle than we 
thought. It appears that there is a very strong correlation between the fork and the bead and 
this effect is stronger when the optical trap is softer. 

This effect is desirable, it is in fact the effect that allows us to gain information on the sequence. 
To better quantify the relationship between the stiffness of the trap and the correlation of the 
bead with the fork we have defined the quantity: 



/(x4,n) 



E 

n 



dx^P^Xi, n) log 



P{xA)P{n 

as the mutual information between the fork and one of the two beads 



(4.84) 



In figure 5.19 we show the effect on the stiffness of the optical trap on the mutual information 
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le-02 




le+04 



4.13 



as a function of the 



Figure 4.14: Relaxation times of the correlation functions in figure 
number of open bases. In the case of the single strand (ss), only the fast relaxation time is 
plotted. For the fork and the single strand, dashed lines indicate a fit to t„ = A + BN^q (with 
^ = 1.3 • 10^3 and B = 8A- 10"^) and Teq = CN^^ (with C = 5.4 • 10-^^ s). For the others, 
full lines are guides to the eye. 





Theoretical (s) 


Numerical (s) 


Single strand 


4.83 • W-^^N'^q 


5.4 • 10-"Af^q 


Double strand 


4.96 • 10-5 


~ 3 • 10-5 


Spring xi 


1.67-10"^ 


~ 1.5 • IQ-^ 


Spring X4 


3.26 • 10"^ 


~ 7- 10-5 


Fork A^eq 


OC 14.2 + 0.013iVeq 


1.3- 10-^ + 8.4- 10-'' iVeq 



Table 4.2: Comparison between the correlation times of the setup in figure 4.4 A. as computed 
for an isolated element and the result of a complete numerical simulation. In the case of the 
fork, we reported as theoretical value 1/kes, that must be multiplied by a viscosity to obtain 
the relaxation time; it turns out that a viscosity ~ 8 • 10-^ pN s/nm matches the theoretical 
and numerical results. 



76 



4.7. RESULTS FROM THE DYNAMICAL MODEL 



CHAPTER 4. MODELING DNA UNZIPPING 



between the fork position and the bead position. We find that softer beads yield more infor- 
mation on the sequence. This can be intuitively understood by thinking that a softer trap 
gives way more easily to the excess length deriving from the opening of a base. 
It must be stressed, however, that this result holds only per measure, that is if one wanted 
to know if it were more efficient to have more rigid traps in an experiment one should take 
into account the autocorrelation times of the bead position. Those are in fact lower for stiffer 
traps allowing for a larger number of statistically independent measures per unit time. 



T 




0.5 1 1.5 2 

k (pN/nm) 

Figure 4.15: Mutual information I between X4 and n as a function of the trap stiffness, k. 
Black circles are computed on an uniform sequence, while red squares are measured on a 
sawtooth potential derived from a sequence that alternates stretches of 10 weak bases and 
stretches of 10 strong bases. 
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Chapter 5 

Inferring the DNA sequence 



As we have shown in the previous section, DNA unzipping experiments show a remarkable 
dependence on sequence in the force-extension signal. Several attempts have been made 
to reconstruct the free energy landscape from different experimental setups [Danilowicz 031 
IWoodside 06a[|Huguet 091 - 

In this section we will concentrate on algorithmic and mathematical approaches to solving 
the inverse problem, that is characterizing the free energy landscape as a function of n and 
eventually sequencing DNA. 

Idealized cases, where the number of open bases n is known at all time, are relatively easy to 
solve, but once we start adding the layers of complexity of real experiments, it becomes really 
difficult to extract information about the sequence. 

The first algorithm we will describe is based on the very idealized situation we have just 

described: infinite sampling frequency and knowledge of the number of open bases. 

The second supposes we can access the equilibrium value of physical quantities like the position 

of the beads with arbitrary precision, ignoring fluctuations. This is much more realistic than 

the first approach, but results are much farther from reconstructing the sequence. 

Lastly we will take a look at a toy model based on the Ornstein-Uhlenbeck model, that takes 

into account correlations and finite sampling frequencies. 



5.1 Infinite bandwidtii algorithm 

In what follows we suppose we have access to the results of a fixed-force experiment where 
the position of the fork is known at all times. Since this is not a realistic situation, the data 
on which to perform the inference must be simulated. 

Let us suppose we have perfect knowledge at all times of the number of open bases. It is clear 
that this is not realistic at all: first of all because the number of open bases n is not directly 
measurable and secondly because in order to obtain bandwidths that are large compared to 
the elementary event time-scale we would need a resolution of the order of the MHz or more 
and current experimental setups allow for resolutions three orders of magnitude smaller. 
The details of what follows were first published in jBaldazzi 061 IBaldazzi 07| . 
In the previous chapter we have defined the opening and closing rate: respectively ro(n) and 
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rdf)- For small enough time intervals At we can write: 



P{n{t + At)\n{t)) 



' Atro{n{t)) forn(t + At) = n(t) + 1 ; 

Atrc(/) forn(t + At) = n(t) - 1 ; 

I - Atro{n{t)) - Atr^if) ioi n{t + At) = n{t) ; ^ ' 

o{At) otherwise. 



This defines completely the transition probabilities from one state to another, and it can be 
used to define the probability of a the outcome of an experiment, that is of a complete trace. 
In order to do so we must define the relevant variables: 

• tn the total time spent with n open bases; 

• Un the number of transitions from n to n + 1; 

• dn the number of transitions from n to n — 1. 

Given those definitions one can write the probability of an experimental trace as 7", condi- 
tioned on the sequence B and on the external force /, as: 

P{J\B) = n(Atro(n(t)))«"(Atre(/))'^"(l - Atro(n(t)) - Atre(/))*"/^* 

(5.2) 

= C{T)]jM{bn,bn+i;Un,tn) ■ 

n 

where we have separated the part that depends on the sequence from that who does not, thus 
defining: 

C{T) = (At)"+'^ exp(-ttotrc(/)); (5.3) 
M{hn,hn+i;un,tn) = exp {gQ{bn,bn+i)un - re^'o(^"'^"+i)t„) ; (5.4) 

(5.5) 

where we have used the definition of Tq and we have defined u = X]„tin, d = X^n*^"' ^^"^ 

^tot — 

Now we can use Bayes' theorem to compute the probability of a sequence given a trace: 

We can further assume (though it is not generally true) that all sequences are equiprobable 
that is P{B) is uniform, this will lead us to a first rough estimate of the sequence given a 
trace. 

We can maximize the expression we have given for P{T\B) over the go{bn,bn+i) without 
imposing that it can only take ten values to get a maximum likelihood estimate: 

goibn,bn+i) = log(j^^ , (5.7) 

This computation is not bad as a first estimate, but it amounts to searching in a continuous 
space when we effectively have only 4 possible values for a base. In order to find the most 
likely sequence B* we can use the Viterbi algorithm [Viterbi 671 [MacKay 05 . 
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The procedure is as follows: let us consider the first two bases and let us define -P2(^2) 



maxfc^ M{bi, 62; ui, ti), then 6™ 



2) = argmax;,j M{bi, 62; ni, ti), and for n 7^ 1 we can write: 

(5.8) 
(5.9) 



Pn+l{bn+l) = m.a.-K M{bn,bn+i;Un,tn)Pn{bn) ] 

bn 



argmax M{bn,bn+i;Un,tn)Pn{bn) ; 

bn 



this means that the optimal value for a base depends on the choice for the next base. 

We can solve these equations up to the last Pj\[{bj\[) which is maximized to obtain 6^ — 

argmaxfe^ P/v(^7v) and we can then propagate back to the first value setting 6* = b^^^{b'^_^_l). 



The algorithm is explained graphically in figure 5.1 



What is great about Viterbi algorithm is that its complexity grows linearly in N and one 




■A 

T 



Figure 5.1: We start by choosing 6™^^(&2) which amounts to choosing the best 61 for each 
choice of 62 and can be represented by an arrow going from 62 to bi and then we iterate the 
procedure until we get to b^ (here N = A). It is now possible to compute the optimum bjy, 
in this case A and propagate back to obtain the optimal sequence TTAA. 



needs to explore only a very small subset of the 4 possible sequences. This is a feature of 
message-passing algorithms in one dimension. 

Another interesting feature of this framework is that unzipping experiments can be repeated 
several times and the different traces can be combined just by computing the product of 
probabilities: 

N 

P{Ti,Ti,..., Tm\B) = \{P{nB), (5-10) 

i=l 

where 71 is the trace of the i*^ experiment of a series of M. 

Therefore we can combine different experiments to infer the sequence. In [Baldazzi 07| it has 
also been shown that the rate of error decreases exponentially with the number of measure- 
ments. 

As we have said at the beginning of this section, however, this algorithm relies on two unre- 
alistic assumptions: knowledge of the position of the fork, which is never attainable because 
we actually measure the position of the bead; and an infinite sampling frequency. In the 
following we will try to come over these two assumptions by building more complex inference 
algorithms. 
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5.2 Perfect averages algorithm 

In this section we will perform a few simplifying assumptions in order to keep the equations 
simple looking. The reader should note, however, that these simplifications are by no mean 
fundamental and our results will hold even after relaxing those assumptions. 
The first assumption is that we substitute Gaussian polymers for the complex behavior de- 
scribed in the preceding chapter and the second is that we ignore the n dependence of the 
spring constants. The first assumption is not of fundamental importance because it amounts 
to truncating the anharmonic effects in the probabilities; relaxing it would only force us to 
compute integrals numerically, slowing the computation down. 

The second assumption is even easier to relax because the n dependence will just change the 
variance of the different terms in the sum in the next equation. 

In general we believe that what is most important here is to have a general idea of what can 
and cannot be done with the spring constants set at realistic values for today's experiments. 
We will show that even without complex polymers and n dependence we cannot investigate 
the sequence at a single base level. 

We define a function u(L\B) as the equilibrium average displacement of one of the beads from 
the center of its optical trap. L is the distance between the traps and is a parameter of the 
experiment and B denotes the sequence. The dependence on B will be omitted from now on. 
The function u{L) has an explicit expression in terms of goin), that is: 

^^^^ = zlB)i^^^~ ""^h.k^ + tk + k^k 

, , (5.11) 

exp (- E • - - my) 



X 



\/kik2 + kik + k2k 



where ki = 0.025 nm~^and /c2 = 0.125 nm~^ are the spring constants of the traps; k = 0.025 
nm~^ is the spring constant of the linkers and the open part of the DNA and may depend 
weakly on n; N is the total number of bases. / = 1 nm is the difference in length when a base 
is open (two ssDNA bases, one for each side), go is the binding energy of the DNA and it's 



given in table 4.1 



For any given value of n, the number of open basis there is a characteristic length of the 



fluctuations of u, which corresponds to the width of the gaussian in (5.11). This length is 
given by: 

the reader should note that spring constants are expressed so that energies are dimensionless, 
that is as A; = /3k where f3 is the inverse temperature and k a spring constant in the conven- 
tional units. 

In the following (unless otherwise noted), 6 = 9.38 as it was calculated from realistic constants 
from Bockelmann's experiment as described in (Barbieri 09] . Other references use different 
setups that yield different numeric values: Woodside et al. [Woodside 06b] have a setup that 



would corresponds to 6 = 6.46 in the same approximation. Huguet et al. jHuguet TOl Supple- 
mentary material] have b = 8.49 for their setup. 

In figure 15. 2| we show two sequences and their corresponding free energy landscape at fixed L 
and the u{L). The reader should note how for a fixed L, values of n as far apart as 60 bases 
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can be visited with non negligible probability, and most of the times there exist two or more 
values as far apart as 20 bases which have a high probability of being visited. 
If we now define a trial function which depends linearly on a set of coefficients cf. 

M 

gtrial{n\ci) = '^Ci^lb'in - h'i) , (5.13) 
1=1 

where Q,y is some one-dimensional function of width 6', which we do not need to specify now 
to keep the discussion as general as possible. We should discuss in the following how b' is 
related to h. 

We can also define Utriai which is u where has been substituted by Atrial, and the cost 
function: 

, Af+Mo 

C{ci) = ^ Yl (^(^^^) - ^trUibl)? : (5.14) 

i=Mo 

where Mq = minb^b'[gQ{b, b')]/kcs and Mq + M = iiiaxb,b'[go{b, b')]/kef[ + N. This amounts to 
taking a measure every bl in the interval where there could be some effect from the sequence, 
for larger (smaller) i all the bases will be closed (open). 

The objective in defining this is to find the set of Cj that approximates the best a set of 
experimental measures. 

inmbfi'[go{b, b')]/ kes = 93.28 nm, and mELXbfi'[go{b,b')]/kes = 343.2 nm for the set of param- 
eters specified previously. The reader should notice that the difference between these two 
numbers is rather large compared to the size of one open base pair (1 nm). 
In effect most of the times we take many more measures than it is necessary, because for 
a given sequence the central limit theorem says it is unlikely that such extremes are ever 
reached, on the contrary the relative fluctuations of the size of the interval of interesting L 
will scale as 1/^/N. 

However, this is not a big computational problem because the computation time will not 
depend as much on the number of measures, as on the number of parameters (the Cj) which 
is fixed. On the other hand taking measures in where the response of the system is purely 
elastic does not change the landscape over which we are optimizing. 
It is now possible to minimize the cost function over the q. 

We will now show some results we have obtained for a random sequence of 50 base pairs and 
i^b{x) = 9{x + b/2)6{b/2 — x) is the boxcar function of width b. There is very good agreement 
between u and Utriab but if we plot go and Atrial the agreement is less good. At some points 
consecutive values of Cj, that is Cj and Cj+i, wander off to values which make it differ greatly 
from gQ. To quantify the difference between go and Atrial ("- 1 Cj) we can define another cost 
function: 

N 

D{Ci) = ^(5^o(?^) - 9trml{n\Ci)f ■ (5.15) 
n 

It now seems natural to define the set of parameters that minimize this new cost function 



as di and compare the Atrial ("- 1 q) and Atrial ('^ I ^^i) as we do in figure 5.4 Where 5triai('^Mi) is 
given by: 

N 

gtriaiinldi) = '^di^b'in - b'i) , (5.16) 

i 
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Figure 5.2: Two different random sequences. On top the goin). In tlie center tlie free energy 
defined as w{n, L) = go{j) + 2{kik2+klk+k2k) ~ "'0^ as a function of n for L = 270, the 
horizontal line marks the energy level E such as exp{— f3{E — Eq)) = 0.01, that is sites that are 
visited (at equilibrium) one hundredth of the time the lowest energy site is. On the bottom 
the u{L). 
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Figure 5.3: u{L) (blue) and titriai(-^ (violet) 


300 




Figure 5.4: gQ{n) (blue), 9triai(«|ci) (violet) and Atrial (n|<ii) (brown) 
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Figure 5.5: u{L) (blue) and Utriai(-^|c^i) (violet) 

In practice this amounts to the average of go{n) over the step of the trial function, in fact for 
a given step we have to minimize Yl\^='\ib-b/2] ~ 9o{j))'^j that is: 

1 ^ 

— 

where \uji\ is the cardinality of cjj, the number of bases that make up a step (it can take either 
[6] or [b\ as values). 

This way we have shown that gtri3i{n\di) is a box average of go{n) which is very different from 
a moving average, and since the Atrial ("- 1 q) has the exact same structure it makes sense to 
compare the two. 



One might also want to know how Utriai(-^^l'^i) compares to u{L). We can see that in figure 5.5 
and the agreement is definitely worse than what it was than when the fit was obtained witht 
the cost function C. 

5.2.1 Prior 



As one can see in figure 5.4 two adjacent steps can sometimes grow in opposite directions to 
non-physical values. 

To avoid this kind of problems we have added a prior to center the values of the steps around 
the average: 



^ M+Mo N 
C^ici) = 2 X] iu{ibl) - Utna.l{ibl)f + 7X^(gtrial(n|Q) - ^o)^ , (5-18) 
i=Mo n 
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Where 7 is a constant we use to increase or decrease the effect of the prior. Ideally we hope 
to obtain a reasonable fit for values of 7 smaller than the biggest eigenvalue of the Hessian of 
C when derived with respect to the Cj's. 

The problem is that sometimes we find no good fit no matter the value of 7. This is shown 



in figure (5.6), as the reader can easily see, the best fit for C does not coincide with the best 
fit for D. A decreasing as a function of 7 indicates that the best fit is dominated by the 
prior. 

prior, to put it in other words: the best fit is the trivial one: q = for all i. 

Some other times we have a non trivial minumum over 7, and things look definitely better as 



in figure (5.7). 

We have tried a prior that would take into account that the potential go can only take 10 
values, so we have chosen the form: 



M 10 



■1-3 

where the gj are the ten possible values that go can take. It is important to note that this 
strategy makes sens only when the trial function has a stepsize of exactly one. 
What we have realized is that when h has reasonable values, around those of current state of 
the art experiments (~ 10), this strategy yields no advantage over the prior we have tried in 
the preceding section. 

At the same time one might think that, for smaller values of 6, say when it's closer to one, 
this prior might help us reconstruct the original sequence, but the reconstruction is actually 
just as good. 

We have yet to find a regime in which this prior makes a difference. 

In conclusion we have found that most of the times a small value of 7 [i. e. 10~^) gives pretty 
good results, otherwise there are very clear signs that the fit has not converged. 

5.2.2 Optimal value of the step-size 

The question is whether this can be further ameliorated by choosing a smaller stepsize. If we 
chose a stepsize h' = 6/2 we obtain the best fit for 7 = 0.000399 and a value of D/N of 0.316, 



while the di yield D/N = 0.26. The results are shown in figure 5.8 

If we further decrease the stepsize to h/A there is not much to be gained: for 7 = 0.00016 we 
obtain D/N = 0.343 which is larger than what we obtained for h/2 while the value for the di 



has further decreased to D/N = 0.16. The results are displayed in figure 5.9 

We now wish to study more systematically the optimal value of 6', to do so we have computed 

the optimal Cj and di for 100 random sequences of 100 base pairs. The results are shown in 



figure 5.10 D{di) gets better and better with smaller stepsize and for h' = b/8 ~ 1.17 it is 
close to zero. On the other hand D{ci) seems to taper off to a value of approximately 0.4A'^. 
We can now define another function we can use to evaluate the goodness of fit: 



N 



E{Ci\di) = ^ (atrial 1 Ci) - Atrial 1 rfj))^ • (5-20) 
n 

E can be thought of as the distance between the fit of the u (ci) and the boxed average {di), 
which is the best attainable fit for a give step-size. 

We expect E to have a non trivial minimum where D(ci) starts to saturate, representing the 
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Figure 5.6: The top two panels show the value of the cost functions C and D ss, a, function of 
varying 7. The bottom panel shows the Atrial (^^ I q) for 7 = 0.0158 (brown), the real (blue) 
and the Atrial ('^ I ^i) (purple). The value of D/N for the di is 0.399. 
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20 40 60 80 100 

Figure 5.7: The top two panels show the value of the cost functions C and D as a function of 
varying 7. The bottom panel shows the Atrial ('i- 1 q) for 7 = 0.0063 (brown), the real go (blue) 
and the gtnai{n'\di) (purple). The value of D for the di is 0.387. 
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Figure 5.11: E{ci\di) as a function of b' averaged over 100 random sequences, the error-bars 
are the standard deviation of the mean 



point where the fit obtained through the u is closest to the average over the steps. The results 



are shown in figure 5.11 



This kind of metric can be a good gauge of what would happen when b is smaller, we have 6's 
which are a half and a quarter of the original. We have obtained this by making I respectively 
twice and four times as long. 

The results for several b' and b = 4.69 are shown in figure 5.12 md the results for b = 2.35 in 



figure 5.13 Please note that we have excluded points where b' would have been less than one. 
We also include the minimum of the average of D and E over 100 sequences obtained for a 



given b, regardless of the value of b' that corresponds to it in figure 5.14 



5.2.3 Comparison with the moving average 

This part stems from the observation that the Atrial I Cj) when b' = 1 looks a lot like a 
smoothed version of the go{n) we have thus defined ga{n) a Gaussian filter as the convolution 
product between the gQ{n) and a Gaussian kernel of width a. 
We then look for the a that minimizes the following cost function: 



N 



F{ci,a) = ^(5(^(n) - Atrial I Ci)^ , 
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0.0 0.2 0.4 0.6 O.S 1.0 1.2 ft 0.0 0.2 0.4 0.6 0.8 1.0 1.2 /, 

Figure 5.12: Value of cost functions when b = 4.69. The cost functions are shown as a function 
of b' averaged over 100 random sequences, the error-bars are the standard deviation of the 
mean. Left: D{ci) (blue) and D{di) (purple) . Right: E{ci\di) 
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Figure 5.13: Value of cost functions when b = 2.35. The cost functions are shown as a function 
of b' averaged over 100 random sequences, the error-bars are the standard deviation of the 
mean. Left: D{ci) (blue) and D{di) (purple) . Right: E{ci\di) 

D/N E/N 
0.5 r 0.5 r 




Figure 5.14: Value of the cost functions D (left) and E (right) for different values of b. The 
plotted value is the minimum of the average over b' (see figures 5.10, 5.11, 5.12 5.13 Error 
bars are standard deviations over 100 sequences. 



94 



5.2. PERFECT AVERAGES ALGORITHM 



CHAPTER 5. INFERRING THE DNA SEQUENCE 




where the Cj are, as usual, the set of parameter that minimize the C cost function. 
After several runs we have found that the optimal value of a is roughly increasing with 
increasing b, but that different sequences can lead to quite different optimal a. One would 
expect a to be linearly related to the optimal b' , but there is too much of a sequence dependence 



to conclude that. Two examples are shown in figure 5.15 



5.2.4 Difference with the naive prediction 

One possible way to perform inference on through the measurement of u{L) at equilibrium is 



to approximate the expression in equation (5.11) through a saddle point. That is to say we 



find the base n* that has the biggest contribution for a given length L and neglige all other 
contributions. 



N 

u' " ' 



(L) = u{n, L)P{n, L) ~ u{n\L) , (5.22) 



where P{n,L) is the exponential in equation (5.11), and u{n,L) = k(,s/ki{L — nl) 



Now, n* is given by maximising P{n, L), by solving: 

go{n*) = kilu{n* , L) , (5.23) 

and this equation looks as though we could use it to infer the goin*) through the value of 
u{n*, L) which should be close to u{L). The point where all this doesn't add up is the choice 
of a suitable L: we'd like to find L{n*) to know which L contributes the most to a given n* . 
To do so we solve: 

g(,{n*) = k,s{L-n*l). (5.24) 

Ideally, we'd like the solution of this to depend strongly on n*, but not through go{n*) which 
is unknown, what we find instead is that with current state of the art experiments goi^*) / k^^ 
is two orders of magnitude larger than /. 

This means that the Lo(n*) that solves this equation is not a nice, linear function of n*, but 
instead depends very strongly on the sequence. This translates itself into a wild n*-dependent 
dephasing between the nai'f prediction and the Gaussian average of the sequence. For short 
(e.g. 100 bases) sequences this dephasing effect is dominant. 
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What this really points to is that the saddle point approximation is not suitable for a case 
where k^s is so small, since it should, in principle, diverge. 

However if we take a constant shift Lq = go/kes we can show how bad this technique is 
compared to what we have proposed, by comparing it to a Gaussian runnin average with 
a = 25 (which is much bigger than the ~ 2 we have used before). The results are shown in 
figure |5.16[ 

This technique can be used to extract rapidly the average of the sequence on ~ 25 bases on 
long sequences, since it is much faster than what we have proposed. 



5.2.5 Scaling of computational time as a function of sequence length 

As of now the algorithm scales as the cube of the number of basis. In principle it is possible to 
split a long sequence in smaller batches, and fit the separately, but some practical problems 
must be adressed. 

First of all we have to take into account what we have discussed in the previson section, that 
is: it is impossible to know the relationship between L and n without knowing the go with a 
sufficient degree of precision. 

So suppose we want to fit a section of the u{L) curve, say from Li to L2, it is impossible 
to say what are the n's that correspond to that interval with precision and we may end up 
adding a few hundreds left and right just to be sure, thus killing any advantage we might have 
had splitting unless the sequence is some 40 kbp long. 

And here is where the second problem comes into play: up to now we have considered k to 
be roughly constant, but k really depends on n albeit weakly. A change in n of the order of 1 
kbp on the other hand would not be negligible anymore and would lower the value of k, and 
thus of fceff of an order of magnitude. 

This is currently a limitation of all current single molecule experiments, whenever opening 
too long a molecule the linkers become too elastic to yield meaningful insights on the go. 



In figure 5.17 we display a fit of a sequence 300 bp long with b = 9.38 and b' = b/2. This 
computation takes Mathematica 7 a little more than 20 minutes on a Intel core 2 processor 
and uses up, about 1.5 GB of RAM. It involves a search in a 64-dimensional parameter space. 



5.2.6 Estimation of the error bars 

In least squares fitting it is costumary to estimate the variance of the variables through the 



lie liiiiiiiiiuiii. ucu ^J'ij — ■ 

Then the variances are given by 



Hessian of the cost function at the minimum. Let Hij = ^.q^. calculated at the minimum. 



2 



cr\H)T^\ (5.25) 



where cj^ is the true residual variance, which is unkown, but is usually estimated as C* /n, 
where C* is the value of the cost function at the minimum and n is the number of degrees of 
freedom. 

If we do so without taking into account the prior H is not positive definite and we end up 
with negative variances. Because of this we use the full cost function with the prior. Three 



examples of the results is shown in figure 5.18 



The reader should note how for an unchanged b, there is not much gain in lowering b' . On 
the other hand when b is smaller the fit is much better and this is refiected in the error-bars. 
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20 40 60 80 100 

Figure 5.16: Here we show two different sequences 500 bases long and one which is only 100 
bases long. The naif estimate u(L + Lo)fci (violet) is compared to a Gaussian running average 
with a = 25. On the 100 bases long sequence we compare the real sequence (blue), to the fit 
obtained for b' = 1 (violet), the moving average with a = 5.65 (brown) and the nai'f prediction 
(bright green) 



5.2. PERFECT AVERAGES ALGORITHM 



97 



CHAPTER 5. INFERRING THE DNA SEQUENCE 




98 



5.2. PERFECT AVERAGES ALGORITHM 



CHAPTER 5. INFERRING THE DNA SEQUENCE 



5 r 





Figure 5.18: For the three panels: go in blue. The other three curves are the gtriah and the 
Atrial i cj. For the top panel we have6 = 9.38, b' = 6/2, = 100. 7 = 0.1. On the bottom 
right we have changed the value of b' to 6/4 and 7 = 10~^. On the bottom right we have 
changed / to nm so that 6 = 2.35 and we have kept 6' = 6, the fit is obtained for 7 = 10""^. 
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5.2.7 Entropy 

Let us suppose we consider the cost functions we have defined in the preceding sections as 
thermodynamic quantities. As such we can draw a physical analogy and estimate the number 
of sequences that correspond to a given value of the cost function through the entropy. 
By defining a pseudo-temperature we can define the partition function as: 





exp 








exp 






E 
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{bn} 
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(5.26) 



Ei 21a;.- 1 iAXiCi 



Xexp 

{6n} 



1^ 2^ TTi Xigoipn, On+l) 



{bn} " 



. Q6/(n-6'i) 

lA . — . Xigo{bn, bn+i) 



where A = Y[i(.^'^\^i\)~^ ■ 

Using the transfer-matrix method we can recognise exp 

matrix which appears \LOi\ times identical and then invo' 
Because of this we can rewrite the previous equation as: 



. flu(n—b'i) ru u \ 
lA |^.| ' XigQ{bn, On+l) 

ves a different Xi. 



as a 4 X 4 



Z = A I dxe ^^2Ki^? iA^^^i 



M 



X bo-\[[T(iXxi)/\ui\)t^\-bN. 

bo,bN i 



(5.27) 



where: 



/ 5.93* 4.71* 12.43* 9.21* \ 

rr u\ r fi, u 2.89* 5.93* 9.78* 12.68* 

T6„,6„+i(t)=exp[5o(6n,6n+i)tJ= ^2.68* 9.21* 23.1* 46.99* 

\ 9.78* 12.43* 49.4* 23.1* / 

By rearranging the terms one can decouple the integrals 

M 



(5.28) 



Z = AY,bo-f{ Jdxie-^' 2ik-l-^^-i-^ [TiiXxi)/\ui 
bo,6jv i 



■b 



N : 



(5.29) 



Once we have computed the one dimensional integrals we can multiply the N matrices and 
sum over the first and last base. 
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Figure 5.19: Entropy for a sequence of 26 bases (25 values of go) and 5 measures. The 
entropy is computed for q = 2.52 for every i. At high energies the entropy saturates to 
S{oo) = 261og(4) ~ 36, the right value which is the logarithm of the number of possible 
sequences of 26 bases. 



We can then change variable (/3 = A^/2) and compute the thermodynamic quantities as: 



d_ 
1 



logZ 



/3 



logZ 



5(E) = max(/3(i?-F(/^))) 



(5.30) 

(5.31) 
(5.32) 



The entropy as a function of internal energy can be also obtained with a parametric plot, as 
it's shown in figure 5.19 Even through the simplifications obtained thanks to the transfer 



matrix method, the computation of entropy is very taxing and we didn't have enough memory 
for computing the entropy for cases where the Cj are not all identical or for longer sequences. 



5.2.8 A different approach 

In this section we will expound a different approach for tackling the same problem as developed 
by Jorg, Monasson and Cocco and is yet to be published. 

The main difference is that this formalism allows for the description of the same system in a 
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vector space and defines measures as orthogonality constraints. 

This formahsm has also allowed Jorg et al. to compute interesting statistical properties of 
this system, but we will not dwell on the details here. 



Equation (5.11) can be rewritten by multiplying both sides by Z{B): 
^Vn{B)[a{L - nl) - l3u{L)]eyi^ -'^{L - nl) 



0, 



(5.33) 



where 



and 



Vn{B) = exp 



a 



kk2 



^/{klk2 + kki + kk2)'^ 



^/klk2 + kki + kk2 



1 1 1 

k ki k2 



(5.34) 

(5.35) 
(5.36) 

(5.37) 



This equation only makes sense if we use for the u{L) the measures we have obtained from 
an experiment. 

Equation (5.33) can be rewritten as: 



(5.38) 



where Pn{L) = [a{L — nl) — /3n(L)] exp [— f (-Z^ — n/)^] is a vector that depends on a on a given 
measure n(L, B). 

This can be thought of as the scalar product v{B) ■ p{L), suggesting a geometrical interpre- 
tation: we have to choose the sequence B so that it is orthogonal to all the vectors given by 
the measures encoded by the vectors p{L) for different values of L. 

The problem of finding the optimal vector v{B) can be rephrased as a minimization problem 
over a quadratic form by squaring both sides: 



^VniB)pniL) 



Vm{B)pm{L)Vn{B)pn{L) 



Y,v^{B)KmAL)vn{B) = v^{B)K{L)v{B) = 0. 



(5.39) 



where Km,n{L) = pm{L)pn{L). Different measures are easily taken into account by adding 
the terms obtained for different L's: 



M 



v\B) 



v{B) = Q, 



(5.40) 



J=i 



where Wi are arbitrary positive weights. 
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5.3 Dynamical algorithm 

5.3.1 A toy model: coupled Ornstein-Uhlenbeck processes 

Real unzipping measurements do not grant us access to the instantaneous force (or displace- 
ment) signal. What is actually measured is a signal which is time averaged over a period of 
a few milliseconds. 

In this section we wish to explore the effects of time averaging on a simple stochastic sys- 
tem. We will compute the probability of observing a series of time averages given a set of 
parameters and thanks to the Bayes theorem we will be able to chose the most likely set of 
parameters given a set of measures. 

Let us consider an Ornstein-Uhlembeck process [Uhlenbeck 30] : 

'yx = —k{x — y) + r] , (5-41) 

where r/ is a Gaussian noise with zero mean and variance {'il{t)r]{t')) = 2kBTj6{t — t'). 

We wish to consider its temporal average x over a certain time and to infer from it the physical 

quantities 7 and k. 

The solution of the model is well known and it's the stochastic function: 

a;(0 = xoe"7* + 2/(1 - e"7*) + - /" dt'e'^^^'^^'^rjit') . (5.42) 

7 Jo 

That is a Gaussian process with mean and variance given by: 

(j;(t)) = xoe"^* + y(l - e"^*) , (5.43) 

{x{tf) - {x{t)f = ^ (1 - e-'^*) . (5.44) 

If we now consider the time average over a time t of the same stochastic function we obtain 
another stochastic function of the form: 

x{t) = ^ ^ dt'x{t') = Yt^xo- y){l - e"^*) + y 



+ 1 / dt' [ dt"e-^^''~''\{t") 
7* Jo Jo 

That is a Gaussian process with mean and variance: 



(5.45) 



(x) = ^(xo-y)(l-e ^*) + y, (5.46) 



{x{ty) - {x{t)y = + (^-3 + 4e-^^ - e-T^j , (5.47) 

and additionaly we should consider: 

{x{tMt)) - (mm) = ^ (1 - e-^*)' . (5.48) 

All this can be summarized defining a covariance matrix as a function of a dimensionless time 
r = kt/^: 



ksT I 1 - 



^ ^ ^ + ^ (-3 + 4e- - 



(5.49) 
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Figure 5.20: The evolution of the Ornstein-Uhlenbeck process during a time step and its 
average. 



But since the process is Gaussian we can write the full probability starting from the means 
vector and the covariance matrix: 

P{x{t),x{t)\xo) = exp (^-h^C'^x^ , (5.50) 

where x = ( ''i^^} \'^['^}} I and is the inverse of C, that is: 
V - {xit)) J 



keT {t {1 + e-^) - 2 {1 - e-^)) 

-r(l-e-) r2(l + e-) J 

What we have just wrote defines the evolution of the system through an amount of time t; 
let us now just suppose that this is just a step in the evolution of the sistem, that is, at time 
(/ — l)At the system is in xi-i and it evolves to xi in lAt as shown in figure 5.20 In this time 
interval its time average is defined as: 

rlAt 



XI = I dt'xit') . (5.52) 
■/{l-l)At 



If we set: xi-i = xq, xi = x{t), xi = x{t) and r = kAt/j we can recycle the previous 
expression to define a propagator: 

P{xi,xi\xi^i] 



2nVd^ (5 53) 

xexp(-l(x.-(x.),..-(..))C-(^;:|^;|)). 

So, as long as the Ornstein-Uhlenbeck process is a Markov process, we can write the joint 
probability of the process as: 

L 

P{{xi,xi}iLi\xo) = '[lP{xi,xi\xi-i) . (5.54) 
1=1 



This is pictured in figure 5.21 and can easily rewritten as a single exponential: 

P(fe.^,}fc.l^.)= ^^^J^,, exp(-^Q) , (5^55) 
104 5.3. DYNAMICAL ALGORITHM 



CHAPTER 5. INFERRING THE DNA SEQUENCE 




where Q is: 



1=1 



Axi + Bxixi-i + Cxi_i - D{xi + xi-i)xi - {xi - xi-i)y - rxiy 

21 



(5.56) 



where: 
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(5.57) 
(5.58) 
(5.59) 
(5.60) 
(5.61) 



We would hke now to integrate out the Xj's in order to obtain the joint probabihty distribution 
for the time averages only: 



/oo 
W^dxiP{xuxi\xi-i) 



(5.62) 



In order to perform this integral in full generality we need to change variables in order to 

diagonalize the quadratic form Q and factorize the integrals. 

Q can be diagonalised by a discrete Fourier tranform, provided we force periodic boundary 
conditions (by imposing xq = xl), that is: 



1 



= ^ ^ x^e L 
1=1 



2niql 



Xl 



2-Kiql 

e i 
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the choice of prefactors ensures the unitarity of the transformation which is known to be 
orthogonal. Q is thus transformed into: 



9=1 



A + C + Bcos 



+ EXl 



VLtuXq + Lt- 



Thus integrating over the Xq yields: 



Xf-D{l + cos 



XqXq 



(5.65) 
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y 



(5.66) 



We can now use Bayes' theorem to interpret the probability in eq (5.62) as the likelihood of 
a set of measures being generated by a given r. 

With standard computational techniques one can compute the log-likelihood in time O(L^). 
In figure 5.22 we show the results for simulated runs of different lengths. It is easily shown 
how the prediction of r improves with more points, but it's already reasonably good with 
only 200 points. 

One could also use the width of this curve to compute k/{k-QT) and by knowing the value of 
the temperature compute 7 and k. 

What is compelling about this algorithm is that we are exploiting all the information available: 
fluctuations, correlations and not only the averages. 
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Figure 5.22: The log-likelihood as a function of r for different L (indicated in the legend), 
in red we show the actual value of r that generated the data. The log-likelihoods have been 
offset by a constant value and changed into its opposite for cosmetic reasons. The minimum 
of the displayed curve is the most likely value of r. 
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Infotaxis 

In the part devoted to infotaxis we have devefoped a continuous version of the algorithm and 
analyzed its behavior and performance in two and three dimensions. 

We have shown the probability of success not to depend on the distance from the source when 
this latter is of the order of magnitude of A, the characteristic length of the odor advection 
phenomenon. 

Furthermore we have shown the search time to grow with 7, the parameter that regulates 
the speed of the searcher in response to the gradient. However, we have observed simulations 
with small 7 to be computationally more taxing. 

The computational time needed to perform a single step is still very large: this is due to the 
need of computing many Monte Carlo integrations over the whole space, but also from the 
strategy we have chosen that increases the time complexity of the algorithm to O(i^). 
In order to bring back the complexity of the algorithm to 0{t) we have toyed with finite 
memory, as in forgetting earlier events, but this has the effect of removing the exponential 
term that discounted the probability at the starting point and at the early hits. This is to be 
avoided because it will attract the searcher very strongly back to where it started. 
We think the solution to this is to coarse-grain past events by decimating older events and 
increasing the weight of the points left. This could leave us with a constant number of points 
and a precision in integration that's only slightly reduced. We think that the coarse-graining 
could be performed on the fly according to the position of those points compared with the 
most recent position of the searcher. 

Another exciting new direction we think could be explored is to think infotaxis as the first 
and simplest strategy in the class of those based on information theory: infotaxis performs 
choices by looking at the immediate next step, what would happen if we looked several steps 
ahead? 

Such a strategy would translate to adding higher derivatives to the differential equation that 
regulates the movement of the searcher: the first such step adding an inertial term: 



This inertial term with a mass tensor proportional to the local curvature of the potential at 
the position of the searcher could have beneficial effects to the performance of the searcher. 
Finally we think another interesting direction to take would be to build a meta-heuristic 
for searches that mimics the behavior we have observed in infotaxis without the need of 
performing the full entropy calculation. For example we could rethink a technique such as 
the one developed in [Balkovsky 02 to work in continuous space and three dimensions. 
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DNA unzipping and sequencing 

In chapter [S] we have shown several approaches to the inference of DNA sequencing through 
micromanipulation experiments. The first issue that stands between us and a successful al- 
gorithm is the fact that the number of open bases is not directly known, but acts as a hidden 
variable, while the position of the bead can be measured directly; the second problem is that 
the temporal resolution in experiments is very low compared to the time-scale of the opening 
and closing of the fork. 

The second section of this chapter deals with the first problem: the fact that the fork posi- 
tion n is unknown. It does so in a limit which is not completely realistic by imposing that 
equilibrium is perfectly attained and that we can sample the equilibrium distribution up to 
an arbitrary precision. In a real experiment there will be many sources of noise and if we 
take averages for a long enough time we will end up measuring drifts in temperature and trap 
position which will change the equilibrium distribution. 

The approach of the third section, on the other hand takes into account the fact that we 
could in principle be out of equilibrium and that an infinite sampling frequency is out of the 
question, but it does so by relying on a very simple model, arguably the simplest non-trivial 
stochastic process in continuous time and space. 

In order to devise a more realistic algorithm we should combine this two approaches, but a few 
difficulties stand in our way: suppose we took the dynamic approach we have used with the 
Ornstein-Uhlenbeck and tried to apply it to a more complicated system, our experience tells 
us that even relaxing the periodic boundary conditions in time makes it hard to diagonalize 
the covariance matrix analytically. 

On the other hand we could try to adapt the idea developed for the perfect averages approach 
and use them in conjunction with the dynamic algorithm: we could describe the potential 
on the hidden variable by a simple potential that depends only on a few parameters, that 
can in turn be fitted, but it is hard to say how a numeric approach can be combined to the 
dynamical algorithm. 

Ultimately we think that many improvements can be brought into play for the experimental 
procedure if one bears in mind sequencing by unzipping as the ultimate goal. One example 
are advances in manipulation techniques through holography, allow for the manipulation of 
multiple beads with a single laser beam [Curtis 02j which could allow for the simultaneous 
rotation of complex objects. Setups similar to a microscopic bobbin or spindle could one day 
become feasible if one could find a way to prevent ssDNA from forming secondary structures 
when confined. 

Another idea suggested to us by Prof. A. Libchaber is the use of proteins that bind to ss- 
DNA stiffening it, bringing us somewhat closer to the measurement of the actual fork position. 
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Abstract 

We present a dynamical model of DNA mechanical unzipping under the action of a force. The 
model includes the motion of a fork in a sequence-dependent landscape, the trap(s) acting on 
the bead(s) and the polymeric components of the molecular construction (unzipped single 
strands of DNA and linkers). Different setups are considered to test the model, and the 
outcome of the simulations is compared to simpler dynamical models existing in the literature 
where polymers are assumed to be at equilibrium. 



1. Introduction 

Over the past 15 years, various single molecule experiments 
have investigated DNA mechanical and structural properties 
[1-18] and protein-DNA interactions [19-29]. These 
experiments provide dynamical information usually hidden 
in large-scale bulk experiments, such as fluctuations on the 
scale of the individual molecule. The separation of the 
two strands of a DNA molecule under a mechanical stress, 
usually referred to as unzipping, was first carried out by 
Bockelmann and Heslot in 1997 [8]. The strands are pulled 
apart at a constant velocity while the force necessary for the 
opening is measured. The average opening force for the 
A-phage sequence is about 15 pN (at room temperature and 
standard ionic conditions), with fluctuations around this value 
that depend on the particular sequence content. Bockelmann, 
Heslot and collaborators have shown that the force signal is 
correlated to the average sequence on the scale of ten base 
pairs but could be affected by the mutation of one base pair 
(bp) adequately located along the sequence [10]. Liphardt 
et al [15] and Danilowicz et al [16-18] have performed an 
analogous experiment, using a constant force setup, on a short 
RNA and long DNA molecules respectively (figure 1(B)). The 
distance between the two strand extremities is measured as 
a function of the time while the molecule is submitted to 



a constant force. The separation of DNA strands has also 
been studied in single molecule experiments by translocation 
through nanopores [26, 27]. 

The motivation underlying unzipping experiments of 
DNA is (at least) twofold. First, the study of unzipping 
aims at a better understanding of the mechanisms governing 
the opening of DNA during transcription and replication by 
proteins such as polymerases, helicases and exonucleases 
[20, 21, 28, 29]. Simple theoretical models describing the 
opening as an unidimensional random walk on a sequence- 
dependent free energy landscape have been proved to describe 
quite well several experimental effects such as stick-slip 
motion in the opening at constant velocity [9, 10], the long 
pauses at a fixed position of unzipping at constant forces 
[16, 30, 31], the hopping dynamics between two or more 
states in unzipping at critical forces of short DNA molecules 
[15,3 1-33] and the torsional drag effects in unzipping at large 
velocity [11, 34]. Moreover, statistical mechanical analyses 
have been successfully applied to extract from experimental 
data the sequence-dependent free energy landscape and the 
height of free energy barriers [35, 36]. 

Second, unzipping experiments could potentially be 
useful to extract information on the sequence itself [37]. 
Recently, single molecule sequencing has been achieved 
by monitoring a DNA/RNA polymerase in the course of 
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Figure 1. Typical experimental setups that will be described in the 
following. (A) A setup with two optical traps (beads xi and X4) 
drawn as springs and whose centers are the black vertical lines and 
(B) a setup with a single magnetic bead .1:3 that applies a constant 
force to the molecule attached to a fixed 'wall'. In both cases, the 
molecular construction is made by a DNA molecule that has to be 
opened (therefore, one should include two single-strand linkers that 
are the opened parts of the molecule) and one double-stranded DNA 
linker The coordinates x, are the distances of the corresponding 
points from the left reference position (which is the center of the left 
optical trap in case (A) and the fixed wall in case (£)). 

DNA synthesis from a ssDNA template [33, 38]; such 
single molecule sequencing could become competitive with 
standard DNA sequencing because they do not require, 
a priori, amplification through polymerase chain reactions. 
A fundamental question on the possibility of extracting 
information on the sequence from unzipping experiments 
is the influence of the expeiimental setup on the measures 
and the limitations imposed by the latter [37, 39]. Indeed, 
characteristic spatio-temporal limitations are the finite rates of 
data acquisition, the relaxation time of the bead, the limited 
spatial resolution, the thermal drift and more generally the 
noise in the instruments. Moreover, the dynamics of the 
opening fork (figure 1) is influenced by the single strands 
(open parts) of the molecule and the linkers, and cannot be 
deduced directly from the observation of the bead from which 
the force or the position is measured. 

The accuracy of unzipping experiments at fixed velocity 
has improved a lot over the last decade. Initially performed 
with an optical fiber [8], experiments were then based on the 
use of simple optical traps [10]. Nowadays, double optical 
traps [13, 36] allow us to considerably reduce the drift of the 
setup and to achieve a temporal resolution of the order of 
10 kHz, a sub-nanometric spatial resolution, and a precision 
on measured forces of the order of fraction of pN. Unzipping 
at fixed force has been performed by a magnetic trap with a 
low temporal resolution (from 60 Hz to 200 Hz) due to the 
time needed to extract the position of the bead, the spatial 
precision being of the order of 10 nm Hz~'/^ [28, 29], or 
by an optical trap also with a low temporal resolution (about 
10 Hz) imposed by a feedback mechanism needed to keep the 
force constant [15]. Recently, a new dumbbell dual optical 
trap has been developed. It operates without feedback and can 



maintain the force constant over distances of about 50 nm [33] 
with a temporal resolution of 10 kHz and a spatial resolution 
of 0.1 nm Hz-'/2. 

Limitations due to the experimental systems were first 
addressed in [39]. This paper stated the impossibility of 
inferring the sequence due to ssDNA fluctuations: fluctuations 
increase with the number of opened base pairs and can 
become larger than the length of about 1 nm corresponding 
to the spatial resolution of one open base pair. This problem 
could however be solved by integrating out the single-strand 
dynamical fluctuations. Several works have studied the effects 
of the setup on the hopping dynamics of small RNA molecules 
[32, 33, 39, 40]. The following effects have been underlined. 
First, the free energy landscape changes when adding a 
harmonic potential to the free energy, due to the bead and 
handles [10, 32, 33, 40]. Therefore, for a given force, 
the measured separation of the extremities depends on the 
stiffnesses of the trap and handles. Moreover, the opening 
and closing rates depend on the stiffness of the optical trap; 
in particular when the experimental system gets softer the 
fluctuations of the force gets smaller, and the hopping rates 
approach their fixed-force values. 

In this paper, we introduce a model for the coupled 
dynamics of the opening fork, the ssDNA strand, the linkers 
and the bead in the optical or magnetic trap. Essential 
notions and existing literature are reviewed in section 2. Our 
dynamical model is presented in section 3. Our program 
allows us to simulate a generic setup, characterized by bead 
dimensions, optical stiffness (absent in the case of magnetical 
tweezers), linker composition (dsDNA or ssDNA) and lengths, 
and the length of molecule to be unzipped. All the parameters 
that characterize the different dynamical components can 
be adjusted in the simulation. The model is then used to 
simulate fixed-force (section 4) and fixed-extension (section 5) 
numerical unzippings. 

2. Free energies, time scales and effective dynamics 

We discuss hereafter the thermodynamic properties of the 
various parts of the experimental setup (DNA sequence, open 
part of the molecule, single- or double-strand linkers), as well 
as the relevant time scales. Finally, we biiefly review previous 
dynamical studies where the linkers and the open portion of 
the molecules are assumed to be at equilibrium. 

2.1. Thermodynamics of the components 

2.1.1. Polymeric models for the linkers and open molecule. 
A polymer model is specified by its free energy as a function 
of the extension x for a given number n of monomers; we 
call this quantity W{x,n). When x and n are large, W is 
an extensive quantity; hence, W(jc, n) = nw(x/n) = nw(l), 
where I = x/n is the extension per monomer. We also define 



/(O 



dW(x,n) 
dx 



w\l). 



1( f) = inverse of/(/), 
g{f) = max[/Z - w{l)] 



(1) 



//(/) - w[l(f)l 
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which are, respectively, the force at fixed extension, the 
average extension at fixed force and the free energy at fixed 
force. Note that g( f) is simply the integral of /(/). Hence, 
a polymer model is completely described from the knowledge 
of the extension versus force characteiistic curve, /(/). In the 
following, we will use some classical models for this function. 



Gaussian (Hook) model. 

lliook(f) 



f 



(2) 



kicif) = coth 



(3) 



where the stiffness constant k'" is related to the 
temperature T and the average squared monomer length 
(at zero force) through k'" = k^T/b^. 
Freely-jointed chain (FJC) model. 

fb\ _ k^ 

MTJ fb 

is the extension (per monomer) of a chain of rigid rods of 
length b, free to rotate around each other. Comparison 
of this model with force-extension curves for single- 
stranded DNA shows that a better fit is obtained from 
a modified FJC: 

£ 

which takes into account the elasticity effects on the rod 
length. Standard fit parameters are d = 0.56 nm, b = 
1.4 nmand y,, = 800 pN. 
Extensible worm-like chain (WLC) model. 

1/2 

1 / A.R 1 \ 

1 



:(/) = rf 1 + 



X hscif). 



(4) 



/WLC(/) 



Kds 



(5) 



is the formula for the high-force extension of an elastic 
chain with persistence length equal to A. Experiments 
show that it is an excellent description of double-stranded 
DNA at high forces, with L = 0.34 nm, A = 48 nm and 
= 1000 pN. 

2.1.2. Free-energy landscape for the sequence. Let Z?; = 
A, T, C or G denote the ith base along the 5' —>■ 3' strand (the 
other strand is complementary) and B = {bi, b2, . ■ . , b^}. 
The free-energy excess when the first n bp of the molecule is 
open with respect to the closed configuration (w = 0) is [31] 



G(n; B) = ^ go(^/, fc,+i), 



(6) 



where go(bi, bj+i) is the binding energy of the bp number 
it depends on bi (pairing interactions) and on the neighboring 
bp bi+\ due to stacking interactions, go is obtained from the 
MFOLD server [41, 42], and listed in table 1 for 150 mM 
NaCl, room temperature and pH 7.5. The values of the free 
energies should be changed for different ionic conditions and 
temperatures. 

As an illustration, we plot the free energy G(w; A) of the 
first 50 bases of the >.-phage sequence, A = (ki, X2, . . . , Ajv), 
in figure 2 after subtraction of ngss(f) for forces / = 15.9 
and 16.4 pN. gss(f) is the work to stretch the two opened 
single strands when one more bp is opened, and calculated 




10 20 30 
nb. n of unzipped bp 



50 



Figure 2. Free energy G (units of k^T) to open the first n base pairs, 
for the first 50 bases of the DNA X-phage at forces 15.9 (dashed 
curve) and 16.4 pN (full curve). For / = 15.9 pN, the two minima 
at bp 1 and bp 50 are separated by a barrier of 12 k^T. Inset: 
additional barrier representing the dynamical rates (21) to go from 
base 10 to 9 (barrier equal to 2gss = 2.5 k^T) and from base 9 to 10 
(barrier equal to goihg, bio) = 3 keT); see text. 

Table 1. Binding free energies go(hi, bj+i) (units of k^T) obtained 
from the MFOLD server [4 1 , 42] for DNA at room temperature, 
pH = 7.5 and an ionic concentration of 0.15 M. The base values 
and bi+i are given by the line and column, respectively. 



80 


A 


T 


C 


G 


A 


1.78 


1.55 


2.52 


2.22 


T 


1.06 


1.78 


2.28 


2.54 


C 


2.54 


2.22 


3.14 


3.85 


G 


2.28 


2.52 


3.90 


3.14 



from the modified FJC model (4). The subtraction allows us 
to compare the increase in the free energy due to the opening 
of the sequence to the gain resulting from the release of ssDNA 
polymers at a given force. 

At these forces, the two global minima in figure 2 are 
located in n = 1 (closed state) and « = 50 (partially open 
state). Experiments on a small RNA molecule, called P5ab 
[15], have been performed at the critical force such that 
the closed state has the same free energy as the open one: 
G(N; A) = Ngss(fc)- They showed that, as the barrier 
between these two minima is not too high, the molecule 
switches between these two states; see section 2.3. 

2.2. Fluctuations at equilibrium 

2.2.1. Case of a single polymer We now consider the 
orders of magnitude of the fluctuations of the polymer. When 
submitted to a force of / = 15 pN, the average extension 
of the polymer is Jf = n.x'" with x"' = l{f). We use 
for single-stranded DNA the MFJC model, and for double- 
stranded DNA the WLC model, with the parameters discussed 



in section 2.1.1; then we get x" 



0.46 nm and 



0.33 nm for ss- and dsDNA respectively. At thermal 
equilibrium, the extension will fluctuate around these average 
values. The fluctuations are controlled by the microscopic 
effective spring constant k'" (I) = w"(l) = l/l'(f). For ds- 
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Table 2. Fluctuations of single-stranded DNA at / = 15 pN and 
T = 16.7°C; &xlx = 0.37/ V^, &flf = 1.57/ V^, 
T = 4.83 X 10-"s;!^ 



Sx/x Sf/f X (s) 



10 


0.117 


0.496 


4.8 X 10' 


-9 


40 


0.058 


0.248 


7.7 X 10" 


-8 


100 


0.037 


0.157 


4.8 X 10- 


-7 


400 


0.018 


0.078 


7.7 X 10" 


-6 


1000 


0.012 


0.050 


4.8 X 10- 


-5 



Table 3. Fluctuations of double-stranded DNA at / = 15 pN and 
r = 16.7 X; .5x/jc =0.17/v^, 5/// =4.83/V^, 
T = 5.1 X lO-'^sn^. 



n 


hxjx 


5/7/ 


T 


(s) 






100 


0.017 


0.483 


5 


1 X 


10- 


-8 


400 


0.0085 


0.241 


8 


1 X 


10- 


-7 


1000 


0.0054 


0.153 


5 


1 X 


10- 


-6 


4000 


0.0027 


0.076 


8 


1 X 


10- 


-5 


10000 


0.0017 


0.048 


5 


1 X 


10- 


-4 



and ssDNA we find, respectively, A:^" = 1311 pN nm~' and 
A:™ = 138 pN nm~' according to the above models. For a 
polymer with n monomers, the stiffness is = fc"'/n since the 
effective spring constant is given by k{x, n) = j^W(x, n) = 
k'"(xln)ln. 

Alternatively, the force / exeited on the polymer will 
fluctuate around its average value / if its extremities are kept at 
a fixed distance x from each other. These fluctuations of force 
(in the fixed-extension ensemble) and extension (in the fixed- 
force ensemble) are easily computed by a quadratic expansion 
of the free energy around the average, i.e. when approximating 
the polymer with a spring of stiffness k'" /n, with the result 

k^T 
~, — 



(Sx^) 



{Sf) = ^ 



(7) 



Defining Sx = ^{Sx^) and Sf = ^(Sf^), we get 
Sx 



kjiT 1 



SI 
f 



k^Tk'" 1 



(8) 



X V A'"(:c'»)2 V^' / y P 

As expected, the relative fluctuations of both force and 
extension become smaller and smaller as the number n of 
monomers increases. Some values are reported in tables 2 
and 3. 

2.2.2. Case of several polymers (fixed-distance setup). 
Now consider the case of several polymers, e.g. linker 
and open part of the molecule attached one after the 
other. In a fixed-force experiment, the components of 
the setup are independent (at the level of the saddle- 
point approximation) and the fluctuations in the extensions 
simply add up. In the fixed-distance setup, however, 
correlations between the extensions make the analysis more 
complicated. As a concrete example, we consider the setup in 
figure 1(A). The linker joining x\ and xj is a double-stranded 
DNA segment of A'ds bases. The two linkers joining {x2, xj,) 
and {xt,, X4) are single-stranded DNA segments of A'ss = 
+ n bases, where n is the number of opened base pairs. 



The centers of the two optical traps are at and X. We call xi 
the position of the first bead and X4 the position of the second. 
The probability P^qin, xi, X2, x^, x^) = e"^/*"^, where the 
free energy F reads as 

F(x, n) = \kix\ + Wi^{x2 - xi,Ni,) + W^,{x3 - xj, N,,) 
+ W,,(X4 - X3, N,,) + \k2{.X4 - Xf + Gin- B), (9) 

where Wi,(x,Ni,) = Na,wa,(x / N^,) and W,,(x,N,,) = 
NstiWsti(x/Nss) are the elongation free energies of the double 
strand and single strand, respectively. 

In order to study the fluctuations in this setup, we first 
find the maximum of P^q assuming that G(«; 5) = ngo, i.e. 
a uniform sequence B, and treating n as a continuous variable 
assuming that it is large. At the maximum xi = Xj and we 
define 

^'n ^ «. ^ ^ (10) 

The saddle-point condition S^iFa = gives the following 
equations, which represent the force balance condition along 
the chain: 

kixi = w'^X^^^ = w'^^{xZ) = k2{X - X4) = /. (11) 

The derivative with respect to n gives, using equations (1) and 
(11), the condition 

go = 2[<>U<') - w4x:D] = gMl (12) 
which allows us to find the force / transmitted along the 
chain. Once (12) is solved, the extensions of the beads and of 
the double- and single-stranded parts of DNA (xi, X — X4, x"^^ 
and X™ respectively) are determined by equation (11). Finally, 
the number of open bases h is determined by 

^1 + Ni,xl + 2« + n)x™ + (X- X4) = X. (13) 

Note that the value of / is determined only by go- 

We work at temperature T = 16.7°C (k^T = 4 pN nm) 
and choose a uniform molecule with go = 2.69k^T , which is 
a representative value for the pairing free energies in table 1 . 
We use the same models as in section 2.2.1 for the single- 
and double-stranded DNA, with A^ds = 3120 and Af° = 40. 
Then solving equation (12) we get / = 16.5 pN, and from 
equation (11) we get x™ = 0.47 nm, x^ = 0.33 nm. We 
choose ki = 0.1 pN nm~', then xi = 165 nm, and k2 = 
0.512 pN nm~', then X — X4 = 32 nm. Given these values, n 
is defined by X using equation (13): 
X - 1264 

(14) 



0.94 

with X expressed in nanometers. 

For the same setup, we can compute the fluctuations of 
n and of the elongations of the elements of the setup. In 
particular, the fluctuations of the bead positions are measurable 
in the experiment. 

Let us define 5x, = x, — x,- andSw = n—h. To simplify the 
formalism, we also define Sx^s = 8x2 — Sxi, Sx^^ = Sx^ — 8x2 
and SXj^^ = 8x4 — 5x3 ■ A quadratic expansion of F around its 
minimum gives 



8F 



■8xi 



1 2 1 2 "'d 

-k,8x,^-k28x,^-^^^ 
.^[(K^-xS^«)%(K^ 



ISnf]. 



(15) 
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\[x'^ = 152pNnm-i and 



Using (4) and (5), we get k'"^ = w[ 
^ds = '«ds(-«ds) = 1416pNnm-'; 

One should take care of the fact that Sxi + + Sx^f^ + 
Sx^^ + Sx^^ = 0; it is convenient to express Sx^^ as a function 
of the others since its fluctuations are identical to those of 
Sx^^. The quadratic expansion of the function SF has the form 
SF = 



^SxASx where Sx ■■ 



(Sxi, SX4, SXds, Sx^^, Ks^'^) ^'^'^ 
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The inverse of the matrix A is 



1 

" 2ki 
1 

" 2k, 







1 

" 2*2 
1 

" 2*2 



where 
1 



eff 



1 1 
— + — + 

ki kj 



0\ 




0/ 




Wd, 

1 

^eff 



(16) 



2k, 
1 

2*2 
2kZ 



' 2k, \ 

_\_ 
' 2k 

Ni 
2k 



1 



(17) 



+ 2- 



(18) 



This immediately gives 

fcBr(A-')i., = 

*:Br(A-')2,2 = 
k^T{A-%,, = 



k2 

k^TNi, 



ksTiA- 
k^T 



')4 



r(A-') 



5,5 



"-ds 

k^T 

4^eff 

k^[ 



(19) 



and shows that the fluctuations of n are dominated by the 
weakest element of the setup; moreover, the correlation 
between the bead displacements &xi, 5x^ and the fluctuations 
of the number of open base pairs 8n is {8n&xi) = — ^^°^,„ and 

{8n&x^) = — ; the stiffer the optical trap, the weaker is the 
correlation between the location of the bead and the number 
of open bases. Examples are given in table 4. 

2.3. Effective dynamical models 

In the simplest dynamical models, the fork (separating the 
open and closed portions of the molecule) undergoes a biased 
random motion in the sequence landscape. The linkers are 
treated at equilibrium, which is correct if their characteristic 
time scales are much smaller than the average time needed to 
open or close a base pair. 



Table 4. Saddle-point calculation for the setup in figure 1(A) with a 
uniform molecule and /ci =0.1 pN nm~', = 0.512 pN nm~', 
Wds = 3120, Nl = 40. The force along the molecule is / = 16.5; 
then = 152 pN nm"' , ytj^ = 1416 pN nm"' and k^^^ = 
0.07 pNnm-'. 



X 


n 


^eff 




1273 


10' 


0.067 


8.2 


1358 


10^ 


0.062 


8.5 


2204 


10^ 


0.036 


11.2 


10664 


lO'* 


0.0068 


25.7 



2.5.7. Time scales for the polymeric components of the 
setup. In this section, we recall the typical time scales of 
the polymeric components in the setup. Assume that the 
polymers are subject to a Brownian force ri(t) which is a zero- 
average Gaussian process with an autocorrelation function 
{r](t)r](0)) = 2rTS{t). Let V be the friction coefficient 
of the polymer [43], that is, the ratio of the viscous force 
exerted by the solvent to the velocity. As will be shown in 
section 3, the friction coefficient scales as F = y'"n/3 with 
= ~ 2 X 10"** pN s nm~^. Then, approximating 
f(x,n)~k'"x/n, the relaxation time for an isolated polymer 
of n bases is given by 

yin 2 

r = 4—. (20) 



3k'" 

Note that the factor 3 in the denominator of the above equation 
is an approximation for the true factor jr^/4. The validity 
of its approximation and the simplification it leads to will be 
discussed in appendix A. 

It is useful to compare the amplitude of the force 
fluctuations with the noise. To do this, we approximate 
{Sf{t)Sf(0)) ~ 2r {Sf^)S(t) = ITV f&it). Then, using 
equation (7) to estimate (5/^), we get T f = wy'"/3 = F, 
and (not surprisingly) the force fluctuations are of the same 
order as the noise term. 

From table 2, the relaxation time of the unzipped strands 
is smaller than the typical base-pair opening (or closing) time 
as long as the number « of unzipped bases is smaller than a 
few hundreds. This is the case, in particular, for unzipping 
experiments on short RNA molecules. 

2.3.2. Random walk in the sequence landscape. Let us first 
model the motion of the fork alone, that is, assuming that 
the other components of the setup are at equilibrium. We 
consider a DNA molecule unzipped under a fixed force / in 
the sequence-landscape G(n; B) — ngss(f) of figure 2. The 
fork, whose position is denoted by n(t), can move forward 
(n —>■ n + 1) or backward (« — >• n — 1) with rates (probability 
per unit of time) equal to, respectively, 

ro(bi,+i,b„+2) = r expl-Pgo(b„+i, b„+2)], 
= rexp[-2fes(/)], 
where p = \/k^T; see figure 2. The value of the attempt 
frequency r is of the order of 10^ Hz [12, 14, 31]. Expression 
(21) for the rates is derived from the following assumptions. 
First, the rates should satisfy detailed balance. Second, we 
impose that the opening rate rg depends on the binding free 



(21) 
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50 100 

time (sec) 

Figure 3. Number of open base pairs as a function of the time for 
various forces (shown in the figure). Data show one numerical 
unzipping (for each force) obtained from a Monte Carlo simulation 
of the random walk motion of the fork with rates (21). 

energy, and not on the force, and vice versa for the closing 
rate r^. This choice is motivated by the fact that the range 
for the base-pair interaction is very small: the hydrogen and 
stacking bonds are broken when the bases are kept apart at a 
fraction of an Angstrom, while the force work is appreciable 
on the distance of the opened bases (~ 1 nm). In contrast, to 
close the base pairs, one has to first work against the applied 
force; therefore, the closing rate rc depends on the force but not 
on the sequence. This physical origin of the rates is reported 
in the inset of figure 2. Note that, as room temperature is much 
smaller than the thermal denaturation temperature, we safely 
discard the existence of a denatured bubble in the zipped DNA 
portion. 

An example of unzipping dynamics for the A-phage 
sequence is shown in figure 3. The characteristic pauses in 
the unzipping, present in experiments and corresponding to 
deep local minima in the sequence landscape, are reproduced. 
The rates (21) lead to a master equation for the probability 
p„ (t) for the fork to be at site n at time t: 

—V— = - 2^ T„^„,p,„(t), (22) 

m=0 

where the matrix T„ ,„ is tridiagonal with nonzero entries 
T,„-i,,„ = -rdf), T„+i,,„ = -ro{m) and r,„,„, = r<,(m) + 
rdf). Given this transition matrix, the opening dynamics can 
be simulated with Monte Carlo dynamics. For small RNA or 
DNA molecules, the transition matrix T„_„, can be diagonalized 
numerically [31]. The smallest non-zero eigenvalue gives the 
switching time between a closed and open configuration for 
a hairpin with a free energy barrier such as that plotted in 
figure 2. 

2.3.3. Dynamics of the bead with equilibrated linkers and 
strands. In a typical experiment, the force is exerted on the 
molecule through the action of a (magnetic or optical) trap 
on the bead. While the external force on the bead can be 
considered as constant (e.g. in a magnetic trap), the force 



acting on the fork fluctuates unless the trap (and the molecular 
construction) is very soft; see equation (8). Therefore, the 
fixed-force model of the previous section has to be modified. 
In addition the bead, of size /? ~ 1 /xm, is a slow component 
whose dynamics need to be taken into account. Let us denote 
by k the stiffness of the trap and by y the friction of the 
bead in the solvent of viscosity r). Typical values for these 
quantities are = 0.1-0.5 pN nm~' and y = 67T Ri] = 
1.6710"^ pN s nm~'. Thus, the characteristic relaxation time 
of the bead is r = y /k 0.2-1 ms. 

The coupled dynamics of the fork and the bead was 
considered by Manosas et al [14] in the case of small RNA 
unzipping, with a single optical trap. For such small molecules 
the relaxation time of the unzipped strands is expected to be 
much smaller than the characteristic time of the bead, and the 
molecule can be considered at equilibrium. The dynamical 
scheme therefore consists in a coupled evolution equation for 
the location of the bead and of the fork. The bead position 
obeys a Langevin equation including the external force and the 
force exerted by the fork through the (equilibrated) linkers and 
unzipped strands, while the fork moves with rates (21) with a 
bead location-dependent force. 

A main conclusion of [14] is that, in the absence of 
feedback imposing a tixed force on the molecule, the trap 
stiffness must be as low as possible to detect jumps between 
closed and open configurations of the RNA molecule. We 
will discuss the validity of this statement in an information- 
theoretic setting in section 5.2. 

3. Dynamical modeling of the setup and its 
components 

The assumption that the linkers and the unzipped strands are 
at equilibrium as the unzipping proceeds is correct for short 
molecules as was the case in [14]. For long DNA molecules, 
the relaxation time of the unzipped strands may become large 
and dynamical modeling of the polymers involved in the 
molecular construction cannot be avoided. 

The purpose of this section is to describe how such a 
dynamical model can be implemented. We hereafter denote 
by 'setup' the full molecular construction that is used in a 
given experiment, including linkers, beads, etc, while the 
word 'molecule' refers to the part of DNA which has to be 
opened. In an idealized description, the state variable is a 
vector X = (;ci, . . . , Xp) whose elements are the distances 
from a reference position (that can be either the center of 
an optical trap or a fixed 'wall' to which the polymers are 
attached) of the extremities of the polymeric components in 
the setup. In addition to x, the number of open base pairs n is 
needed to complete the description of the state of the setup. 

As discussed in section 2.1, the total free energy F(x, n) 
of a setup is the sum of different contributions coming from 
all the elements of the setup. A typical example is given in 
equation (9). 

Our aim is thus to construct a dynamical model that holds 
on intermediate time scales, t > 10~^ s, and 

(i) gives the correct equilibrium Gibbs measure Peq(x, n) = 
exp{-F(x, n)l(k^T)), 
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(ii) reproduces the relaxation times for the different elements 
of the setup, as discussed below, 

(iii) gives reasonable dynamical correlations between different 
elements of the setup. 

It is worth stressing at this point that ours is a coarse- 
grained model which does not take into account the motion of 
the individual monomers. It is expected that the dynamics on 
time scales smaller than the typical sojourn time of the fork on 
a base (> 10~^ s) is not relevant to our study of unzipping. 

3.1. Langevin dynamics for the polymers and the beads 

First, we consider the dynamics of x at fixed n. In appendix A, 
we show that for long enough times the dynamics of the setup 
can be described by a system of coupled Langevin equations: 



where i, j = I, . . . , p, and 



dF 
dxi 



+ 1i 



(23) 



Fj+ij = Yi"l\^ that get a contribution from the 
polymer joining jc; and Xi+i. 



For instance, the setups in figure 1 correspond to the 
matrices: 



ss 3 



6 



6 



3 

v' 



3/5 





'ss 6 



6 



/K + Kdsf 



y<ii 6 



V 



J'ds 6 






0\ 



(25) 



the free energy F(x) is the sum of a contribution coming 
from each element of the setup: 

(i) each optical trap contributes }jkAx^, where Ax is its 
elongation; 

(ii) a bead in position / subjected to a constant force gives 
a contribution —/a-, ; 

(iii) a polymer gives a contribution IV, (Ax, A',), with Ax 
being its elongation and A', its number of monomers. 

For example, the total free energies of the setups in 
figure 1 are 

Fa(x) = \kix\ + lVds(A2 - Xi, Afds) + Wss(X3 - X2, N^,) 

+ W,,{X4- Xi,N,^) + \k2(.x^- Xf, 
Fb(x) = Wds(xi, TVds) + W,Ax2 - xi, N,,) 

+ W,,(X3-X2, N,,) - /X3. 

^ is a Gaussian white noise with zero average and 
variance (r?,(f)';>(0)> = ^k^TFjjSit), as requested by 
the fluctuation-dissipation relation, 
the matrix F is a tridiagonal matrix such that 

(i) the diagonal element F,, is the sum of three 
contributions: 

(a) a term Y"-\Ni-\l'i + y"'Ni/3 coming from the 
adjacent polymers (if any); 

(b) a term y coming from the bead (if any) attached 
to X, ; 

(c) a term taking into account the viscosity of the 
Nc base pairs of the DNA molecule attached 
to the fork (X3 and xj in figures 1(A) and (5) 
respectively) that are not open; this term has the 
Fleury form ymo\ = y Nc and has to be added 
to the diagonal element of F corresponding to the 
fork position; 

(ii) the offdiagonal elements are zero, except F,- ,+1 = 



A detailed derivation of these results and in particular of 
the form of the matrix F can be found in appendix A. 

3.2. Fork dynamics 

The Langevin equation for the polymer dynamics at fixed n 
must be complemented with transition rates for the dynamics 
of n. To this aim, we discretize the Langevin equation with 
time step At, and at each time step we allow the opening 
« ^ n + 1 or closing « ^ w — 1 of a base pair at most. 

The dynamics takes the form of a discrete time Markov 
chain, with transitions (x, n) — >• (x', n') and n' e {n,n ± 1). 
The total free energy F(x, n) = Fsaup(x, n) + G(n; B), where 
the first contribution has been discussed in the previous section 
and G(n; B) is the pairing free energy of the molecule, as 
discussed in section 2.1.2. In appendix B, we show that in 
order to satisfy the detailed balance condition with respect 
to Peq(x, n) = exp(— F(x, n)/(k^T)), one should perform a 
single step following the procedure. 

(i) Choose whether to stay (n' = «), to open (n' = n + 1) 
or to close (n' = n — 1) a base, with rates r^'° ''(x, n) 
respectively: 



r''{x,n) 
r'^(x, n) 
r^(x, n) ■■ 



g/if(-?,«)-/3F(J,«-l)^ 

1 — r"(x, n) — r^(x, n). 



(26) 



(ii) If the choice was to open,^rif perform a discrete Langevin 
step X — ^ x' at fixed n and then increase n by one. 

(iii) If the choice was to close, first decrease n by one and 
then perform a discrete Langevin step x — >• x' at fixed 
n' = n — \. 

(iv) If the choice was to stay, just perform a discrete Langevin 
step X — >• x' at fixed n. 

The Langevin equation is discretized in a standard way 
by integrating equation (23) over a time At: 



Xi(t + At) = Xi(At) + F,..' 



9F(x) 
9xi 



At + E 



(27) 



where Ej = fi^'rij{t)dt are Gaussian variables with zero 
average and variance 

(EiEj) = 2kTiTr,jAt (28) 

that are independently drawn at each discrete time step. 
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3.3. Free energy at finite n 

In section 2.1, we discussed some models for the free energy 
(x , n) of a polymer with n monomers and extension x. In the 
limit X , n ^ cx) at fixed extension per monomer, I = x/n, the 
free energy enjoys an extensivity property: W(x, n) = nw(l). 
However, in our simulations we might be interested in regimes 
where n is small, typically of the order of 10^0 for small RNA 
molecules. In this case, knowledge of the free energy per 
monomer, w, is not sufficient, and a more detailed expression 
for W is necessary to avoid inconsistencies. 

As a starting point of the analysis, we consider a polymer 
made of A'^ identical monomers whose endpoints are denoted 
hy II i, i = I, . . . , N with = 0. The Hamiltonian of the 
chain is the sum of pairwise interactions (p(ui — m,_i) and the 
free energy reads, for x = Uf/, a& 



-pW{x,n) 



I 



, duN-\ e' 



(29) 



where £q is a reference microscopic length scale. From the 
above relation, the Chapman-Kolmogorov equation follows: 



-pWU.n+m) 



-pW{y.n)-f>W(x-y,m) 



(30) 



We first consider for simplicity the Gaussian model, (p(x) 



W(x,n) 



2n 



■log 



Ink^Tn 



(31) 



In the limit of large polymers, one obtains the free energy of a 
monomer of extension I through 
1 

wil) = lim -W{x = In, n) = (p{l) (32) 
«->oo ri 

as expected and consistent with the discussion of section 2.1. 
The logarithmic term in (31) contributes neither to w nor to 
the Langevin equation for x. However it does contribute to 
the rate to close a base pair (see equation (26)) and should be 
taken into account in order to recover the correct rates. An 
example of the effect of this term is obtained by computing the 
equilibrium probability of n. Consider the (unrealistic) case 
of a homopolymer, G(n; B) = ngo, subject to a constant force 
and using a Gaussian model for the open part of the molecule; 
then 



Pcq(n) 



dx e 



-nPgo-pW(.x,2n)+pfx 



-nflgo+ff 



(33) 



Therefore Peq{n) is a pure exponential, while if the correction 
were neglected one would have obtained wrong behavior at 
small n. 

For a generic model of ^ (x ) , one cannot compute W(x,n). 
Still we found that for our purposes (n > 40), a consistent 
approximation is obtained by keeping only the first coiTection 
to the n ^ oo result, i.e. by defining 



-pW(x.n) 



-finw{x / n) 



Pk(x/n)£l 
Ijtn 



(34) 



where k(I) = w"(l). One can check that this expression 
satisfies equation (30) with coreections in the exponent of 



0(1), while the terms O (log n + log m ) are taken into account. 
Within this approximation, the error in log r^. (x , n) in equation 
(26) is 0(l/n^) while if the first corrections are neglected it 
is 0(l/n). 

In the following, we will make use of definition (34) unless 
otherwise stated. We will discuss an example where the effects 
of neglecting the corrections are clearly observable. 

3.4. Details of the numerical simulations 

We performed numerical simulations of the molecular 
constructions depicted in figure 1, with the following 
specifications. 

• The total free energies of the two setups are given by 
equation (24) plus the term G(w; B). 

• The free energy of each polymer includes the saddle-point 
corrections, i.e. it is given by equation (34). The relation 
/(/) (see section 2.1) is numerically inverted to obtain 
w{l) and k(l) that enter in equation (34). 

• For the single-stranded DNA we used the MFJC model, 
equation (4), with d = 0.56 nm, b = 1.4 nm and 
y,, = 800 pN. 

• For the double- stranded DNA we used the WLC model 
in equation (5), with a small regularization term to avoid 
a divergence for / ^ 0, which is however irrelevant for 
values of forces to be discussed in the following, and with 
A = 48 nm, L = 0.34 nm and y^, = 1000 pN. 

• Unless otherwise stated, the double-stranded DNA linker 
is made of A'ds = 3120 bps, while the two single-stranded 
linkers are made of A^ss = 40H-n bases each, where n is the 
number of open DNA bases (in other words, we included 
on each side a 40-base single-stranded linker). 

• We worked at fixed temperature k^T = 4 pN nm, 
corresponding to T = 16.7 °C. 

• We used the dynamical equations for the polymers defined 
above, equations (23), within the discrete procedure 
illustrated in section 3.2 and with transition rates (26) 
for the fork with the attempt rate r = 10^ Hz. 

• The matrices F corresponding to the setups in figure 1 
are given in equation (25); we used y^j" = = y' = 
2 X 10"** pN s nm~'. We used a value y = 1.67 x 
10~^ pN s nm~' for the viscosity of the beads. 

• The time step was fixed at At = 10~** s; this value ensures 
a correct integration of the equation of motion in all the 
regimes discussed below. Even if in some cases a larger 
integration step could be used, we decided to keep it fixed 
in order to be sure that discretization biases are not present. 

The values of the spring constants ki and ^2 and of the 
force / in equation (24) varied in different simulation runs, 
and will be specified later. 

The program we used for the numerical simulations can 
be downloaded from http://www.lpt.ens.fr/ zamponi. A user- 
friendly version will be made available as soon as possible. 
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3.5. Limits of validity of the dynamical model 

Our model of the polymer dynamics suffers from two main 
limitations. 

First, we keep only one collective coordinate for each 
polymer (its extension) associated with the longest relaxation 
mode. Faster modes are discarded. The approximation is 
justified provided there is no other mode slower than the typical 
sojourn time on a base pair. From the discussion of section 
2.3.1, the number of unzipped base pairs, «, cannot be well 
above a thousand. 

Another upper limit on n comes from the assumption that 
the force is uniform along the polymer. In principle the force 
is a function of the time / and the location y along the polymer, 
which obeys a diffusion equation with a microscopic diffusion 
coefficient Z)^ ~ (x;!")^/t™, where jc" is the length of a 
monomer and r™ = is its relaxation time. Assume 

that, at time 0, a base pair closes and the polymer is stretched 
at the extremity x = by .x"^ . Then the force, initially equal to 
f {x, r = 0) = fe™x^5(jc), will decay following the Gaussian 
diffusion kernel. At time t, the force density at the extremity is 
/(jc, t) = k"^lx'^J^2jiDfj. The relaxation is over when this 
force excess is of the same order of magnitude as the typical 
thermal fluctuations 8f calculated in (8), that is, for times 

"-ssV ss/ m ^ T in- 10 

t > n \ <' ~ 2 X 10 '"n ps. (35) 

When n ~ 1000, the corresponding relaxation time is of the 
order of the sojourn time on a base. 

In conclusion, our dynamical model is adapted to ssDNA 
polymers whose length ranges from a few hundred to a few 
thousand bases. Shorter polymers can be considered at 
equilibrium, while longer polymers cannot be modeled without 
taking into account the space dependence of forces. A simple 
way to tackle this difficulty consists in arbitrarily cutting long 
polymers into 1000-base long segments, each modeled as 
above. This procedure will be followed in section 5.1. 

4. Unzipping at fixed force 

4.1. Quasi-equilibrium unzipping 

Before turning to the more interesting case of out-of- 
equilibrium unzipping, we focus on the case of a small 
molecule which is subject to a constant force close to the 
critical force. In this situation, the molecule is able to visit all 
the possible configurations. 

We performed a set of numerical simulations at constant 
force / = 16.45 pN, with the setup described in figure 1(5). 
The DNA molecule is a uniform segment of N = 500 
base pairs, with pairing free energy G(n; B) = ngo and 
go = 2.69kBT. The entropic free energy per base of the 
two open single strands is 2gss(f) = 2.6Mk^T. Therefore, 
the infinite molecule would stay close; we are slightly below 
the critical force. To the right and left open portions of the 
molecule, two single-stranded DNA linkers of A'^ = 40 bases 
each are attached; therefore, the total length of the single- 
stranded linkers is A^ss = -I- n, where n is as usual the 




n 



Figure 4. Bottom: average fraction of the time spent on each base. 
The full (blue) curve corresponds to equation (34) while the dashed 
(black) curve corresponds to equation (34) without the saddle-point 
corrections (the square-root term). The dot-dashed (red) line is 
P^qin) oc exp[— fjAg] with Ag = 0.006. Top: effective rates 
(squares and triangles) estimated from the maximization of the 
probability in equation (36) (r = 10' Hz) without saddle-point 
corrections (full curve of the lower panel). The dashed lines are the 
asymptotic values of the rates; see text. We do not report the rates 
corresponding to the full equation (34) since they are essentially 
independent of n. 

number of open base pairs. The leftmost linker is a double- 
stranded DNA of A^ds = 3120 base pairs, whose presence is 
however irrelevant for the scope of this section. The total 
length of the simulation was T = 7200 s, i.e. 2 h. 

4.1.1. A test of the model. The average fraction of 
time spent on each base, corresponding to the equilibrium 
probability distribution Peq(n), is reported in the lower panel 
of figure 4. We expect that in the large n limit, Peq(n) ~ 
exp[-n(go - 2gss(/))] = exp[-«Ag], with Ag ~ 0.006. 
This is expected to break down when A'ss is so small that the 
second-order corrections to the saddle-point in equation (34) 
become important. As can be seen in figure 4, the exponential 
form correctly describes the data. 

We performed additional simulations in which the square- 
root term in equation (34) was removed. As one can see, in this 
case the small n deviations are much more pronounced. It is 
worth noting that for a non-Gaussian polymer, one expects 
a deviation from the exponential form at small enough n. 
However, this analysis shows that taking into account the small 
n corrections to W(x,n) systematically reduces this effect. 
Estimating its real order of magnitude therefore requires an 
exact expression for W(x,n), which could be in principle 
obtained from the recurrence equation (30). However, this is a 
complicated numerical task that goes beyond the scope of this 
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paper. What we want to stress here is that the inclusion of the 
square -root term in equation (34) gives significant differences 
when n < 200 and should therefore be included if one wants 
to analyze the unzipping of small molecules. 

4.1.2. Effective dynamics of the fork. In a situation where the 
linkers are short, such that their relaxation time is faster than 
the mean time spent on a base, the linkers are able to reach 
equilibrium before n changes. Therefore one might hope to 
define an effective dynamics for the fork, where n changes 
according to effective rates that depend on the variation in the 
free energy of the setup on closing or opening a base. 

To this aim we considered the model for the fork dynamics 
described in section 2.3, but assuming n-dependent opening 
and closing rates. Within this model, the probability of a 
trajectory of the fork is a function of the number of upward 
(M„)/downward (d„) jumps and the time spent on base n, t„: 

N 

Pefdnit)] = Y[ {rf(n)AtY"{rf(n)Aty" 

n=l 

x(l-Af(rf(n) + rf («)))'". (36) 

Given the values of u„, d„, t„ measured along our trajectory of 
duration T, we can infer the effective rates by maximizing the 
above probability. Assuming that r'^^At <^ 1, we obtain 

rf{n)=^, rf(n)='^, (37) 

tn tu 

as estimates for the effective rates. For the full expression 
(34), the rates are almost independent of n; on the other 
hand, if the first-order correction is neglected, one obtains 
«-dependent rates, consistent with the observation that Peq(n) 
is not exponential. These are reported in the upper panel of 
figure 4. In both cases, the rates are consistent with the detailed 
balance condition r^^{n)Peq{n) = rf^{n — l)Peq(w — !)• 

4.2. Out-of-equilibrium opening 

For long molecules, the barrier between the closed and open 
states may become very large, e.g. ~ 3000 k^T for the 50000 
bases >.-DNA at the critical force fc = 15.5 pN [31]. The 
time necessary to cross this barrier is huge, and full opening 
of the molecule never happens during experiments. To open a 
finite fraction of the molecule, the force has to be chosen to be 
larger than its critical value. The opening can then be modeled 
as a transient random walk, characterized by pauses at local 
minima of the free energy and rapid jumps in between [16]. 

4.2.7. Analytical calculation of the average time spent by the 
fork on a base. First consider the case of a fixed force acting 
on the fork while all the other components are at equilibrium 
as in section 2.3. In the transient random walk, the opening 
fork spends a finite time around a position n before escaping 
away and never coming back again in n. The number u„ of 
opening transitions w ^ « -H 1 is stochastic and varies from 
experiment to experiment and base to base. The total number 
of times the fork visits the base pair « before escaping is given 
by the sum of the number u„ of transitions from « — 1 to « 



and of the number u„+\ — 1 of transitions from « H- 1 to «. 
Therefore, the average time spent in n is 

(m„) + (m„+i) - 1 

'n = rr . (Jo) 

ro + rdn) 

where \/(ro + rc(n)) is the average time spent in n before each 
opening or closing step. Let us introduce the probability ii"^; 
of never reaching back position n starting from position n + \. 
The probability P of the number u„ of opening transitions 
n —>■ n + I during a single unzipping simply reads as 

p(«„) = (i-£;;,,)"""'£„".i- (39) 

From equation (39), we have that the average number of 
openings of bp n is 

(U„) = 2^ P(Un)u„ = — . (40) 

We are thus left with the calculation of £""+1. For infinite 
force, = 1 since the fork never moves backward. For 
finite force, we write a recursive equation for the probability 
Zi," that the fork never comes back to base n starting from base 
m(^ n + 1): 

£;; = ?™£^i + (i-?„.)£Lp (41) 

where 

^" ~ e&(/)H-eSofe.A».i) ^'^^^ 
is the probability of closing base n and 1 — q„ is the probability 
of opening it at each step. Note that for forces larger than the 
critical force, we have q„ < i : the random walk is submitted 
to a forward drift and is transient. The boundary conditions 
for equation (41) are = and = 1 for m — >• oo. 

For a homogeneous sequence, the escape probability is 
E = (I — 2q)/(\ — q). For a heterogeneous sequence by 
defining p',", = -jpr-, we obtain the Riccati recursion relation: 

p:=0; pLi= for«>m. (43) 

1 - q,„+ip;i, 

Equation (43) can be solved numerically for a given sequence. 
Then, the escape probability starting from n -H 1 is 

E"„.,= n P""' (44) 

and the average time spent in the base « is then obtained from 
(40) and (38). 

4.2.2. Results from the dynamical model. To check 
whether these theoretical predictions are affected by dynamical 
fluctuations of the bead, linkers and unzipped strands, we 
have carried out simulations with the model of section 3. We 
have carried out 160 unzippings of the >.-phage sequence at 
a force of 17 pN for T = 100 s (physical time), with the 
same molecular construct of section 4.1 (A'ds = 3120 base 
pairs of dsDNA linkers on a side plus N'^^ = 40 bases of the 
ssDNA linker at each side of the DNA to be open). For such a 
construct, the equilibrium extension of the polymers for n open 
base pairs is INsJsn + N^Jds, where /ds = 0.3337 nm, l^s = 
0.4758 nm and A'ss = A^^^^ -i- n. The stiffness of the polymers 
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Figure 5. Top: average time spent by the fork on position n. Bottom: time spent by the whole setup at an extension between X3 and 
xj + Ax, with Ax = 0.5 nm. The black line in both figures represents the theoretical predictions from section 4.2.1. The red points are the 
results from the simulation. Standard deviations are represented by error bars in the top panels and by the thickness of the red curves in the 
bottom panels. 



is l//teff = A'ss/A:^ + NiJk'l with = 160.5 pN nm"' 
and fcj" = 1450 pN nm~'. The relaxation times of the 
polymers are of the order of 0.1 ms for about 400 unzipped 
bases and 1 ms for about 2500 open bases, and are larger than 
the characteristic times of about 2 x 10~* s needed to open a 
weak base and of about 10~^ s needed to open a strong base. 

We plot in figure 5 the average time spent by the fork at 
location n for two portions of the sequence, corresponding 
to about 400 and 2500 open base pairs. The agreement 
between the theoretical and numerical estimates of the times 
is excellent, meaning that the fluctuations of extensions of 
the polymers and the dynamics of the bead induce negligible 
changes on the rates of opening and closing, as seen close to 
the critical force in section 4.1. 

As experiments do not give direct acces to the time spent 
by the fork at location n, we show in figure 5 (bottom) the time 
t (X3) spent by the unzipped ssDNA between extensions and 
X3 + d;i:. These times are compared to their values assuming 
that the positions of the beads are randomly drawn from the 
equilibrium measure: 

/(X3) = ^f„P(X3|M), (45) 

n 

where t„ is calculated from (38) and P(xs\n)is calculated from 
an argument similar to that used in section 2.2.1 and can be 
written up to the quadratic order around the saddle point as 

p(;,3|„) = / PMIl ^-p'-¥^U,-NMf)-2NJ„(f»\ (46) 

V 2:7r 

The agreement is, again, excellent. 

Figure 5 and equation (45) show that tixj) gets 
contributions from the times spent by the fork on a set of bases 
whose number depends on the magnitude of the equilibrium 



fluctuations of the linkers. These equilibrium fluctuations 
increase with the length of ssDNA, e.g. ^^3 ~ 5 nm for 400 
unzipped base pairs and 5^3 ~ 1 2 nm for 2500 unzipped bases. 
Therefore, as the number n of unzipped base pairs increases, 
the characteristic curve of t (x^) gets more and more convoluted 
(compare left-bottom and right-bottom panels in 5). 

In figure 6 we compare the value of the ssDNA extension 
from one unzipping, x^, to its average value at equilibrium, 
jCj'', as a function of the number of unzipped base pairs n. 
The fluctuations in the extension are compatible with the 
equilibrium deviations. Again, no clear out-of-equilibrium 
effect is observed. The reason is that, even if the single strand 
is not relaxed in the opening time of a base, the fork goes back 
and forward around a given location before moving away. 
Therefore, the quantities we have measured are averaged on 
the number of times a base pair is opened and are close to their 
mean value even in a single unzipping. This can be deduced 
from figure 5 by comparing the total time spent on a base 
(points) with the time to open a base (dashed lines) 

5. Unzipping at fixed extremities 

5.1. Correlation functions 

One of the main advantages of considering the dynamics of 
the linkers and of the beads is that it allows us to compute 
autocorrelation functions and to explore the interaction 
between different parts of the setup, a task which would be 
impossible from a priori calculations. 

We have performed a few simulations with the setup 
shown in figure 1(A) where the spring constant of the first 
optical trap of extension xi is 0.1 pN nm~' and the second 
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Figure 6. Total extension X3 of the setup in figure 1 (fi) at a fixed 
number n of unzipped bases for a single unzipping (black line). If 
the fork visits the same base n twice or more, we plot the average of 
the extension values. The gray strip represents the average value at 
equilibrium, x'^(n), and the standard deviation around its value at 
equilibrium. 

{x^) has stiffness 0.512 pN nm~' . The molecule in the fork is 
uniform with go = 2.69k^T . The only parameter that is varied 
across simulations is the distance between the optical traps and 
thus the typical number of open bases. In figure 7, we show 
two typical cases. What is evident is that the single strand 
has two time scales: one which is proper to the fluctuations at 
n fixed and another which is of the same order of magnitude 
as the correlation time of the fork. As the number of open 
bases grows, the fast time scale also grows until it becomes 
impossible to distinguish the two. 

As remarked in section 3.5, our model cannot in principle 
be used when the linkers are made of n > 1000 monomers. 
To check for the importance of force propagation effects, 
we ran a simulation for A'ss = 9700 (bottom panel of 
figure 7) where we cut each linker into nine subunits of 1000 
bases each plus a final unit which is connected to the opening 
fork. Overall, the correlation functions are not much affected 
by this modification and in particular the correlation times are 
unaffected within numerical errors. The main effect of cutting 
the long linkers is that the correlation function of the linker 
becomes more stretched (i.e. if they are fit with exp[— (t /r)^' ], 
the exponent fis is slightly smaller). This is to be expected since 
by cutting the polymer we include more relaxation modes, each 
with its relaxation time. A wider distribution of relaxation 
times implies a smaller exponent jSj. In table 5, we compare 
the results of the numerical simulation with the predictions of 
section 2.2.1 which do not take into account the interactions 
between different parts of the setup. While the simulated 
results for the single-stranded and the double-stranded DNA 
are not too far off from the prediction, the two springs show 
a much greater deviation from the theoretical estimates. This 
prompted us to analyze further the relationship between the 
fork and the bead position as will be discussed later. 

The potential acting on the fork position, in the case of 
a uniform molecule, is dictated by the stiffness of the rest of 
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Figure 7. Correlation functions for the setup in figure 1(A) at two 
different values of the number of open bases, A'ss = 40 + «. 

Table 5. Comparison between the correlation times of the setup in 
figure 1(A) as computed for an isolated element and the result of a 
complete numerical simulation. In the case of the fork, we reported 
as the theoretical value l/^eff> which must be multiplied by a 
viscosity to obtain the relaxation time; it turns out that a viscosity 
~ 8 X 10"'' pN s nm~' matches the theoretical and numerical results. 

Theoretical (s) Numerical (s) 

Single strand 4.83 x lO""//^^, 5.4 x IQ'^^Nl 

Double strand 4.96 x lO""^ ~3 x 10"' 

Spring xi 1.67 x lO"* ~1.5 x 10"^ 

Spring X4 3.26 x 10"^ ~7 x 10"' 

Fork A'ss oc 14.2 + 0.013Wss 1.3 x lO'^ + 8.4 x 10-^«ss 



the setup only as seen in section 2.2.1. That is to say that 
n experiences a harmonic potential with the spring constant 
proportional to kf-a; this in turn predicts correlation times that 
are proportional to ^ which has a linear dependence on n. 
This behavior is in very good agreement with the data that 
have been extracted from numerical simulations. 



5.2. Mutual information between the bead position and fork 
location 

Figure 9 shows the dynamical correlations of the fork and bead 
positions. The two beads have different correlation functions 
due to the difference in their stiffnesses: k = 0.5 pN nm~' for 
bead 1 and k = 0.1 pN nm~' for bead 2. After an initial decay 
(taking place over a time proportional to 1/fc from section 
2.3.3), the bead correlations exhibit a quasi-plateau behavior 
whose height is roughly proportional to l/k. The plateau 
reflects the correlation between the motion of the bead and 
that of the fork on time scales of the order of the equilibration 
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Figure 8. Relaxation times of the correlation functions in figure 7 as 
a function of the number of open bases. In the case of the single 
strand (ss), only the fast relaxation time is plotted. For the fork and 
the single strand, dashed lines indicate a fit to t„ = A + BN^s (with 
A = 1.3 X 10"-^ and S = 8.4 x 10"') and r,, = CN^ (with 
C = 5.4 X 10~" s). For the others, full lines are guides to the eye. 

time of the fork. It appears that soft beads allow one to track 
the location of the fork better than stiffer beads. 

In the following, we will give a closer look at the 
dependence of these correlations on the optical trap stiffness; 
to do so we construct a setup as in figure 1(A), but where 
the stiffness of the optical trap on the left is kept constant at 
0.512 pN nm~' while the stiffness of that on the right is varied 
across two orders of magnitude^^. 

To give quantitative support to this statement we define 
the mutual information / between the position of the bead in 
the optical trap, X4, and the number of open base pairs, n: 

/(X4, w) = V /" (bc4P(x4, n) log ( ) (47) 

where P(x^, n) is the joint probability density for the bead 
to be at position while there are n open base pairs; P(n) 
and P{x^) are the two marginals. Note that the definition of 
mutual information does not suffer from the problems which 
arise with entropy when we switch between a continuous and a 
discrete definition; that is to say that binning with sufficiently 
small bins does not change the mutual information. 

/ can be easily computed by keeping track of the times 
passed at a given bead position and the given number of 
open bases during a run of the simulation. As stressed 
before, the fact that the x^ coordinate must be binned has 
negligible effects on the computation of entropy. For very 
large stiffnesses the amplitude of the oscillations of the bead 
can become very small, and thus a lack of sensitivity in the 
measure of the position of the bead could become an issue. 
Fortunately, the current state of the art in the optical trap 
cannot attain stiffnesses larger than, say, 1 pN nm~' with 

^ An attentive reader might have noted that we changed the stiffness of the 
right bead compared to what it was in the previous section; the rationale 
behind this choice is to keep its value at the center of the range in which we 
will vary the other. 
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Figure 9. Top: autocon'elation functions for the setup in figure 1(A) 
when the molecule to unzip is a block copolymer composed of 
alternating stretches of ten strong pairs and ten weak pairs. This 
way the fork correlation time is greatly increased allowing us to 
view effects on the two traps of different optical stiffnesses. Bottom: 
correlation functions between one of the two beads and the number 
of open base pairs. Values have been normalized so that the value at 
zero time difference is p = {xtn) / y/ (x}) {n^) . 

micrometer beads [32]. In this regime, the fluctuations of the 
bead are dominated by the stiffness of the trap and thus we 
can say that {Sxf) ~ (pk2)~^; see equation (19). Comparing 
the fluctuations of the bead position with the sub-nanometer 
precision A over its location yields 

^ ^ ~ 10-50, (48) 

which is much larger than unity. 

Figure 1 shows that the mutual information / only weakly 
depends on the sequence but strongly depends on the stiffness 
k of the trap. This behavior can be understood very intuitively. 
Right after a base pair opens or closes, the whole setup in 
a fixed-force experiment has to give way; the less rigid an 
element of the setup is compared to the rest, the more it will 
accommodate for the change in n. 

We conclude that, in a single measurement, soft traps 
give more information on the fork location than stiff traps. 
However, I is the mutual information between the fork and 
bead locations per measure. As we have seen in section 5.1, 
the correlation times extracted from the simulations decrease 
with k and, as k grows, more and more uncoreelated measures 
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Figure 10. Mutual information / between X4 and n as a function of 
tlie trap stiffness, k. Black circles are computed on a uniform 
sequence, while red squares are measured on the sawtooth potential 
described in the caption to figure 9. 

can be done in the same amount of time. It is thus expected 
that information per unit of time is not maximal for small 
values of k. In other words, stiffer traps give worse quality 
but more frequent signals on the location of the fork. Finding 
the optimal value of k would require a detailed analysis of the 
correlation times of the bead and of the fork. In particular, the 
size of the bead would affect the optimal value for k through 
the viscosity coefficient, but not the information per measure, 
/. However this dependence should not be crucial since the 
bead size cannot be much varied in experiments: it can be 
neither too small to exert a sufficient force nor too large due to 
the size of the physical setup. 

6. Conclusion 

This paper has been devoted to the presentation of a dynamical 
model for the different components of the setups used in the 
unzipping of single DNA molecules under a mechanical action. 
Compared to previous studies, our model does not assume 
a priori that the polymers in the molecular construction are at 
equilibrium but takes into account their relaxation dynamics. 
It is important to stress out that the dynamical description for 
the linkers and the unzipped part of DNA is coarse grained: the 
basic unity is the polymers themselves and not the monomers 
they are made of. 

As a consequence, each polymer is associated with a 
unique relaxation time. The assumption is justified as long 
as these times are comparable to the typical opening or 
closing time of a single base pair. Longer polymeric chains, 
e.g. ssDNA strands with a few thousand bases, need to 
be modeled in a more detailed way; more precisely, they 
should be divided into short enough segments along which 
the force can be considered as uniform on the time scales 
associated with the fork motion. Although in this paper 
we did not observe any important force propagation effect, 
these might be more important in strongly nonequilibrium 
situations such as opening at constant (high) velocity. We plan 
to simulate unzippings with such molecular constructions in 
the near future to understand how force propagation across the 



polymeric segments can affect the effective rates for closing 
base pairs in such situations. 

One of our results is that one has to be very careful with 
the expression of the free energies (entering the dynamical 
rates) for short polymers, be they linkers or ssDNA unzipped 
strands. Use of the free energy per monomer, obtained from 
force-extension measures on long molecules, as usually done 
in the literature, can lead to erroneous results. We have shown 
that finite-size corrections to the energetic contributions and 
the dynamical rates have to be taken into account. 

As a main advantage, the code we have developed is 
versatile: we can easily change setups, for example use a fixed- 
force or fixed-position ensemble, and change the number and 
types of linkers and of traps for the beads. We have found 
that, in fixed-force unzippings, the opening and closing rates 
for the fork are not affected by the force fluctuations coming 
from the polymeiic chains. For small linkers and a number 
of unzipped base pairs, indeed, force fluctuations are large 
but fast, and are averaged out on the characteristic opening- 
closing time of a base pair. For large linkers or a number 
of unzipped bases force fluctuations are slow but small, and 
therefore do not change the dynamic of the opening fork. 
We have also performed unzipping simulations at large forces 
where the opening dynamics is transient, and found that the 
average time spent by the unzipped strands at a given extension 
is accurately predicted from the time spent by the fork on a 
base convoluted by the equilibrium fluctuations of ssDNA. 
Moreover, the extension between the extremities at a fixed 
number of open base pairs in a single unzipping experiment 
is compatible with equilibrium fluctuations of ssDNA and 
linkers. The program could be easily adapted to unzipping 
at constant velocity, where non-equilibrium effects are likely 
to be more important. 

Our study suggests that one measure of the position of 
the bead in soft traps gives more information on the location 
of the fork than in the case of stiffer traps. This statement is 
however to be considered with caution. Beads in stiffer traps 
reach equilibrium on shorter time scales, and the overall rate of 
information per unit time could be higher in stiffer traps. While 
purely qualitative at this stage, such a statement is relevant to 
the study of the inverse problem of unzipping, that is, inferring 
the sequence of the DNA molecule from the unzipping signal. 
We hope that the present dynamical modeling will be useful 
to assess the rate at which information on the sequence could 
be acquired from mechanical single molecule experiments. 
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Appendix A. Langevin dynamics of coupled 
polymers 

One of the simplest models of polymer dynamics is that 
proposed by Rouse [44], where the polymer is described as 
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a chain of beads which are modeled as Brownian particles, 
linked by harmonic springs. 

While it is true that this model is very crude because 
it ignores hydrodynamic interactions and exclude volume 
effects, it has the huge advantage of being largely solvable. 
Therefore, we will now use it as the basis for a few 
considerations that will then be generalized to more realistic 
models. 

Our aim is to write a system of coupled equations for 
the time evolution of a certain number of marked points on a 
(hetero)polymer. One of these points will be for instance the 
location of the opening fork. In the case of a double DNA 
strand attached to a single strand, one point will mark the 
location where the two different polymers are attached (see 
the examples in figure 1). Note that if the marked points we 
focus on are far apart, only the slower modes of the system 
will be relevant, as the fast modes describe local relaxations of 
the chain. Therefore, in the following, we want to focus on a 
long wavelength/long time effective description of the chain. 

A.l. The dynamics of a single polymer 

A. 1.1. The model and its normal modes. As the simplest 
case we consider a polymer composed of identical springs, 
each with an identical link at one end. The first is connected to 
a wall that has infinite mass (or, better still in this framework, 
infinite viscosity) and on the last a force / is exerted. The 
Langevin equations describing such a polymer can be written 
as 

' Ymiii = -2k,„ui + k,„U2 + m 



YmUN — —k,„UM + ki„Ufj-i + f + rifj, 
where are white Gaussian noises of zero mean and variance: 
{r)i{t)r)jm = 2k^TS,jm. (A.2) 
Let us for the moment neglect the noise term. Then, defining 
T,„ = Yin/ km, we can formally rewrite these equations as 

T„M„ = —2u„ + + M„+i, V «, (A. 3) 

supplemented by the boundary conditions 

Mo = 0, un+1 = UN + f/k,„. (A.4) 
A standard way to find the normal modes of the above 
linear system is to search for solutions of the form K„(f) = 
u„(0)exp(—Xt/Ti„). One can easily show that the general 
solution satisfying the first boundary condition mq = has the 
form 

u„(t) cx sin(^M) exp(— /(^)f/r„,), 
X(^) = 2(l-cos(^)). 
The second boundary condition (A.4) requires that un+\ (t) — 
Uf^(t) = f/ki„ = const. Since we can always add the constant 
value to iif^+i(t), we can replace this boundary condition by 
Uf^+i(t) = Unit). This requires that sin(^ A') ~ &m{q(N + \)); 
theni^r = (it /2 + p7t)IN . The slowest mode then corresponds 
to q = jt/2/N, which for large A' gives a relaxation time 

4 



(A.5) 



T(A^) = r„,/HJt/2/N) 



(A.6) 



A. 1.2. Recurrence equations for a fixed end. We now want 
to write a system of coupled equations for a certain number 
of points on the polymer by integrating out ms we are not 
interested in. To begin, we focus on the end point u^. 

It is convenient to perform a Laplace transformation and 
write 



/> oo 

M„(0 = / dXM„(A)e"^'/^"' 
Jo 



Then equation (A.5) becomes, in Laplace space, 

(2 - X)Un{'k) = U„+i(X) + M„-l(A), 



(A.7) 



(A.8) 



with the same boundary conditions mo(A) = 0, and u m+i {X) — 
UffiX) = (f/k,„)S(X). For / / 0, the latter condition reduces 
to Ufi+i(X) = um(X.) as discussed above for the normal mode 
analysis. 

We introduce a function 



fn-l(>^) = U„.i{X)/Un{X). 

Substituting the latter relation in (A.8), we get 

{2 - X- i;„.x(X))un{X) = u„+i{k), 
from which we get a Riccati recurrence equation 
^q(X) = (due to Mo = 0), 



1 



(A.9) 



(A. 10) 



(A.11) 



(A.l) ?„(^) 



2-X-^„-i(X) 
This recurrence can be solved and the function f„ (X) computed 
for all n. 

Since we are interested in the large time limit, we can 
expand the function f„(>.) for small X; we obtain 

n(l 



n + l 6(1 +n) 



■2n) n(6+ 19n + 16w- +4w^) 2 



180(1 +w) 



+ 0(X^). (A.12) 

One obtains the effective equation for m Af by substituting the 
above expression in (A. 10) and setting n = N. Keeping only 
the linear term in X and the leading terms in N ^ 1 , we get 



A' 



UnW = UN+\iX). 



1 

77 "3 

Moving back to the time domain, we obtain 
1 



N 



un). 



(A.13) 



(A. 14) 



which is equivalent, using the boundary condition um+x—ui^ 
f/k,„, to 



YmN 



-Un 



I r 

-UN + f. 



(A. 15) 



which proves the validity of the scaling in equation (20). 



3 A' 

In this way, we got an effective equation for the endpoint of 
the polymer that is still a linear first-order differential equation 
and takes into account only the slowest mode of the chain. 

There is however an inconvenience: in fact a 
straightforward computation shows that the relaxation time 
obtained from equation (A.15) is r(A^) = r,„Af^/3 that differs 
by a factor ;r^/12 from the correct value given by equation 
(A.6). The origin of this discrepancy clearly lies in the fact that 
the expansion we made in equation (A.12) is not convergent 
at fixed X for n 00, as successive terms in the series are of 
order n^P-^XP. 
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Figure Al. Two joint polymers subjected to an external force /. Xi 
marks the endpoint of the first polymer made of N links whose 
endpoints are ui, U2, . ■ . , Un-\, mjv = JCi- The second polymer 
originates from X\ and is made of 2M — 1 links, whose endpoints 
are i;_(m-i), i'-(iM-2), ■ • • , Ui, . . . , v^-i, Vm-,,X2. 



bead viscosity, etc. If the monomers are identical, then we are 
just marking a point in the middle of a polymer. 

The effective equation for the endpoint of polymer U can 
be derived following the analysis of the previous section. We 
denote xi = Uf/ and we get 

y,ll jxiit) = -fxi(t) + k^,{v-(M-i){t) - xt(t)), (A.17) 

where the last term is the 'external' force that the polymer V 
exerts on U. 



Let us then go back to the computation of the normal 
modes of the system within this formalism. The second 
boundary condition Mftf+i (A) = um(^) implies ^f^(X) = 1. The 
normal modes are the solutions of this equation with respect 
to X. One can show from the exact expression of ^^(A) that 

-qcot(q) = '^iq). (A. 16) 



lim N[^N(q^/N^) - 1] 



The zeroes of this function are = jt/l + kit; therefore, 
the solutions of fAr(X) = 1 tend for large N to k = (jt/2 + 
P7t)^/N^, in agreement with the exact result of the previous 
section. An inspection of equations (A. 12) and (A. 16) shows 
that the small X expansion of is equivalent to performing 
a small q expansion of f (q) in order to find its first zero. This 
indeed yields ^(q) ~ —l+q^/3 that gives q = V3 for the 
first zero that gives back t(A') = r,„N^/3. 

Then one can check that a higher order expansion in X (or 
equivalently in q) produces a more accurate result; indeed the 
series of f (q) converges for < :7r while the zero is located at 
q = jr/2. It is easy to show that if one truncates the series to 
order p, the difference between the solution and the true zero 
is exponentially small in p. 

A. 1.3. Discussion The conclusion of this section is that 
equation (A. 15) is a correct description of the dynamics of the 
end of the polymer in the limit of large and large times. 
While it captures the correct scaling with N of the relaxation 
time, the coefficient is wrong by a factor of jr^/12 ~ 0.82. 
Still, this is quite satisfactory for our purposes since the 
experimental error in the determination of t,„ is of the same 
order of magnitude. Better approximations can be obtained 
by truncating the expansion of to higher orders in X, 

therefore obtaining a higher order differential equation for 

In the following, we will derive the coupled equation for 
many marked points along the chain, limiting ourselves to the 
first-order truncation. This produces first-order differential 
equations of the Langevin type. 

A.2. Dynamics of two coupled polymers 

We will now show how to use this formalism to derive coupled 
equations for different points on a composite polymer. We 
continue neglecting the noise, which we will reintroduce at 
the end of this section. 

As a simple example, let us consider the polymer drawn 
in figure Al . It is composed of A' monomers of type 'U' linked 
to 2M — I monomers of type 'V. The two types of monomers 
might differ in the value of the microscopic spring constant. 



(A.18) 



A.2.1. Integration of the V polymer. Now we want to 
integrate out all the monomers • • • , vm-\ in order 

to obtain the coupling between xi and X2. To this aim, and in 
order to keep the formalism symmetric, we can start from the 
middle of the polymer V by integrating simultaneously d_i 
and vi in order to obtain effective equations for u_2 and V2, 
and so on. In Laplace space (note that now in equation (A. 7) 
= T^), the equations for v±\ have the form 

{2-X)v.i{X) = + 

{2-X)vi(X) = V2(k) + v_i(>-)- 
These can be easily solved to get i;±i as a function of v±2. 
Iteration leads to the following form for the equation after n 
steps: 

^n(X)V-n-liX) = V-n-2(X) + r]n(X)v„+l(X), 
^n(>-)V„+l{X.) = Vn+2(X) + ri„(X)V-„-l(k) . 

One can check that this form is stable under one step of iteration 
and the following recursion relations are obtained: 

\^o = 2-X, 
% = 1, 

I, 



(A.19) 



ri„+\ 



- X ■ 



hi 



(A.20) 



ril 



where the initial values are determined by consistency between 
(A.18) and (A.19) for n = 0. These recurrences are easily 
solved by introducing the two quantities A„ = l/(t„ — ))„) 
and B„ = l/(^„ + l^„) respectively; these satisfy the same 
recurrence in (A.l 1) except for the initial condition which is 
different and determined according to (A.20). 

At the leading order in « — >• 00 and at first order in A, we 

get 



1 2n 
1 + :rX, 



1 2n 

rjAX) =— + —X. (A.21) 
6 



2n 3 2n 
Finally, one obtains from this procedure a coupled equation 



for V- 



(M-l) 



and vm-i where xi = v-m and X2 = vm also 



appear. 



A.2.2. Coupled effective equations. To obtain the coupled 
effective equations, one starts from the following system: 

■--^Xi+kl(v_M+\-Xx), 

tM-2Q-)V-M+l(>~) = Xi+ r]M-2WVM-lW, (A.22) 
hM-2MVM-l(>-) =X2 + r]M-2{X)V-M+l(X), 

.(1 - X)X2 = Vm-i + f. 
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where the first equation is just the Laplace transform of 
equation (A. 17) (recall that we use the definition of Laplace 
transform (A. 7) with t„, = rj^), the second and third equations 
are equation (A. 19) for n = M — 2 and the last equation is the 
Laplace transform of the equation for X2, which in the time 
domain reads as y^xj = —k^^(x2 — vm-i) + f ■ 

Eliminating V-m+i and vm-i from these equations, using 
the recurrence equations (A. 20) and the result (A.2 1 ) we finally 
get the coupled equations: 



.A' 



V 
/ m 



,2M 



x\ + y,„ 



,2M 



-X2 



= ^JCi +^(X2-Xi), 

N 2M 
^2M y2M , kl 



{X2 - Xi) + f. 



At this point we reintroduce the free energy of the polymer 
chain, defining N\ = N and A'2 = 2M — 1 ~ 2M: 



FiXl,X2) 

and a matrix 



2A', 



k^ 

x\+^{X2-Xyf 
' 2N2 



Im 3 Im ^ 
' m 6 



Im 3 



SO that we can write the above system as 
3F 



9x/ 



+ /;■ + '?;, 



(A.24) 



(A.25) 



(A.26) 



where / = (0, /) is the external force vector and we 
reintroduced the noise term that we neglected before. 

The correlation function of the noise at this point is 
determined by the requirement that the fluctuation-dissipation 
relation is verified. This imposes that 

{m(t)r)jm = 2k^TY;jm- (A.27) 

A3. Beads 

At this point, we should add the beads that are used for the 
optical manipulation of polymers. These beads are optically 
tweezed or subjected to magnetic fields in order to apply forces 
to the polymers. In the former case, the force acting on the 
bead is a harmonic force / = —k{x — X), while in the latter 
it is constant, / = /ext- Each bead is characterized by a 
friction coefficient that can be computed using the Stokes 
law; we denote it by y . Typically they are of the order of 
10~' pN s nm~', i.e. much bigger than the microscopic 
viscosity of the polymers y„, ~ 10"** pN s nm~' . 

In the presence of a bead attached to the endpoint of a 
polymer, the equations of motion (A.l), (A. 18), etc, remain 
valid, but one should add the contribution of y to the viscosity 
of the coordinate describing the position of the bead. For 
instance, if there is a bead attached to the endpoint mat, the last 
equation of (A.l) reads as 



(A.28) 



Then the above derivation still holds because the last equation 
is not used until the end. The only modification will be the 



inclusion of y on the diagonal element F,, corresponding to 
the coordinate of the bead. 

Therefore to describe the beads attached to the end of the 
molecular construction in figure 1, we modify the matrix F 
as above, and in case A, we add to the free energy a term 
^k{x4 — X)^, while in case B we add a term —fc^Xi- 

In the case of figure 1(A), one also has to include the left 
bead. In this case, if we call V the first polymer after the bead, 
we can start from a system of equations identical to (A. 22), 
but with the first equation replaced by 



yxi 



-kxi +k,„(v.M+i - Xl). 



(A 23) This will again lead to (A.26) with 



r in (, 
im 3 



and 



F{XuX2) 



2N' 



■(X2 - Xl) 



(A.29) 



(A.30) 



(A.31) 



A.4. Description of a generic setup 

The arguments of the previous section suggest that in the 
general case, a bead can be treated 'as a particular instance 
of a polymer'. In other words, we can consider the setups in 
figure 1 as chains of p joint elements U = Ui, U2, • ■ • , Up', 
each element can be an 'optical trap' (i.e. a spring) or a polymer 
of Ni, N2, ■ ■ ■ , Np monomers respectively (in the case of an 
optical trap, we set by default A^, = 1)- The endpoint of each 
element is denoted by Xi, and x = (xi, X2, ■ ■ ■ , Xp) is the state 
vector of the system (we also define xq = 0). 

Then, the total free energy is F(x) = X!f=i Wui(xi—Xi-i) 
where Wij. (x) = \kx^ for an optical trap of stiffness k. Then 
equation (A.26) holds, with /, j running from 1 to p and the 
noise correlation matrix is given by (A.27). 

The matrix F must be constructed as follows. Each 
diagonal term F,, , related to x, , is the sum of a Stokes term 
coming from a bead possibly attached to x,- and the contribution 
coming from the two elements adjacent to x, (except for i = p 
when there is only one contribution): 



F;; 



(A.32) 



(the first term is present only if there is a bead attached to x, ). 
All the off-diagonal elements are zero except those adjacent 
to the diagonal (i.e. connecting x,- and x,±i) which get a 
contribution from the polymer connecting these two ends: 

F,,+i = F,+i,i = K„y'-'^, = l,...,p- 1. (A.33) 

6 

Note that this final formulation is independent of the Gaussian 
form of F(x) that we assumed in the derivation; therefore, 
we will also use it for non-Gaussian polymers substituting the 
appropriate form of F{x) in equation (A.26). 

To conclude this section, note that a further check of 
the quality of the first-order approximation can be done as 
follows. If we consider a single polymer made of A'l + A'2 
bases, the corresponding relaxation time is predicted to be 
T = r,„{Ni + A^2)^/3. On the other hand, we could 
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consider two coupled polymers of A'l and A'2 bases following 
equation (A.23) for yj^^-^ = y,„ and k^^-^ = k,„ respectively. 
The coupled equation can be exactly solved and yields two 
distinct relaxation times (that typically differ by a factor 
of 10); the slowest relaxation time can be compared with 
T = t,„{Ni + N2)^/3. We found that the difference is at most 
20%, and the error is maximal for Ni ~ A'2 while it decreases 
when one of the two polymers is much longer than the other. 

Appendix B. Transition rates for tlie forl^ dynamics 

We now consider a fork n attached to the polymers. For 
simplicity, we consider the case of a single polymer whose 
extension is x and free energy is W(x,n). We want to construct 
a stochastic process that samples the equilibrium distribution 
PeqC-t, n) = e-^"*'(^'")-G(«;B)/z, where -G(n; B) is the free 
energy gain in closing the first n bases of DNA, as defined in 
equation (6). 

The random process is constructed as follows. The 
Langevin equation discussed in the previous section is 
discretized with time step At. If at a given time t the system 
is in a state (x, «), we allow three possible transitions: 

• (x, n) ^ (x + Ax, n) with rate Wix, n, Ax), 

• (x, n) — >• (x + Ax, M + 1) with rate H"(x, n. Ax), 

• (x, w) — ^ (x + Ax, « — 1) with rate H^(x, n. Ax). 

We must have 



/ 



dAx//''(x, w, Ax) + H'\x, n, Ax) + H''(x, n, Ax) = 1. 

(B.l) 

Moreover we can define rates r''-°'^(x,n) = 
/ dAx//^'°'''(x, n. Ax) that represent the rates to stay, 
open or close n independent of Ax. In a practical 
implementation we first decide whether to open, close or stay 
according to r* " '', and then extract Ax from the distribution 
H'-"-'(x,n, Ax)lr''"-''{x,n). 

The detailed balance conditions read as 
P(n,x)H"(x,n, Ax) 

= P(n + 1, X + Ax)H^(n + 1, x + Ax, — Ax) 
P(n,x)H^(x,n, Ax) 

= P(m - 1, X + Ax)H''{n - 1, X + Ax, - Ax) ^^'^^ 
Pin,x)H'(x,n, Ax) 

= P(n, X + Ax)H'(n, x + Ax, — Ax). 

We assume that the rate for opening is given by the product 
of a term that only depends on the binding free energy as in 
equation (2 1 ) and a term corresponding to a standard Langevin 
step: 



H°{x, n. Ax) = rAt e 



G{n:B)-G{n+\:B) 



AnTAt 



X exp 



Yn 

'ATAt 



Ax 



f(x,n)At\ 



(B.3) 



Note that integrating over Ax we find r"(x, n) = 
^^j^G(„:B)-G(„+i:B) ^ Q-so{b,M.b,M)^ consistent with 

equation (21). 



Now it is easy to show that the following expression 
for H'^(x, n. Ax) follows from the second detailed balance 
condition: 



H'{x, n. Ax) = ^Af e^^'--"'-^'^*'^^^'^-"-" 

y„-i I /(x + Ax, n — 1) Af 



AnTAt 



y,i-\ 



X exp 



ATAt 



Ax + 



Yn-l 



(B.4) 



and that the first condition is then automatically satisfied. Up 
to now, we did not specify the form for /(x, n). However for 
a generic /(x, n), the above rate is not Gaussian. To obtain a 
Gaussian rate, we assume that 

dW(x,n) 



f(x, n) 



dx 



(B.5) 



and perform the following simplifications assuming that Af is 
small: 

//^(x, «, Ax) = rAf e^'^'-^-'"-^'^<--"-" 



^pW(x.n-l)-pW(x+Ax,n-l)-pf(n-l,x+A.x) 



AjtTAt 

Yn-l 



X exp 



Yn-l 



Ax 



ATAt 

^^f^pW(x,n)-tiW{x.n-\) 
Yn-l 



f(x + Ax, n — I) At 

Yn-l 



AnTAt 

Yn-l 



X exp 



ATAt 



f(x + Ax,n- l)AtY 
Ax I 



Yn-l 



P d^W(x,n- 1) 
2 3x2 



Ax^ 



(B.6) 



Neglecting O(Ax^) one obtains a Gaussian distribution for 
Ax, and computing the first and second moments of the 
Gaussian one can see that at the lowest order in At it is 
equivalent to 



//' (x, «, Ax) = rAf e 



pW(x.n)-pW{x,n-l) 



AitTAt 



X exp 



Yn-l 

'ATAt 



Yn-l 
2" 



Ax ■ 



/(x,»- 1)AA 



Yn-\ 



(B.7) 



From the above expression, we deduce that the rate for closing 
is r^{x,n) = r Af e^'*'*-* "'"^"'''^'""'', and one_^rif has to close 
and then perform a Langevin step with force f(x,n — 1) and 
friction y„-|. 

Finally, the rate at constant n is simply given by 



H\x, n, Ax) = [1 - r°(x, n) - r'(x, n)] 



AnTAt 



Yn 



X exp 



Yn 



ATAt 



Ax-l^^^\ 

Yn ) 



(B.8) 



and it is easy to see that this verifies the third detailed balance 
equation if equation (B.5) holds and higher orders in Af are 
neglected. 

To resume, the implementation of the algorithm is as 
follows. 
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(1) Choose whether to stay, open or close, with rates 
r^'°''^{x, n) respectively. 

(2) If o^&n, first perform a Langevin step at n and then increase 
n by 1. 

(3) If close, ^ra? decrease n by 1 and then perform a Langevin 
step at w — 1 . 

(4) If stay, just perform a Langevin step at n. 

(5) Goto 1. 

The extension of the above derivation to a case where 
many polymers are present is straightforward, since the only 
polymers whose rates are coupled with n are the two adjacent 
ones. All the other polymers are not influenced by n, and one 
can use standard discretized Langevin dynamics. 
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Introduction. — Reaching a target with limited infor- 
mation is a fundamental task for living organisms. Small 
organisms, such as bacteria and eukaryotic cells, are 
thought to estimate and ascend the gradient of nutri- 
ent concentration, a process called chemotaxis [1,2]. At 
the scale of a larger organism the Reynolds number 
is higher [3]: most biologically relevant chemical fields 
become turbulent and dilute. As a result, the trajectories 
of insects following odor traces appear much more complex 
than those of smaller organisms [4] . The modeling of search 
processes in the presence of noisy information is important 
not only for biology, but also for robotics [5,6]. 

Assume that the search has proceeded for some time, 
and the searcher has received some hits, i.e. has detected 
some molecule of odor sent by the target, along the 
trajectory. How should the searcher move next? The 
timing and locations of the hits, as well as the absence 
of hits along the remaining parts of the trajectory 
all provide useful information about the location y of 
the target. In Bayesian terms, this defines a posterior 
probability Pt{y) for the position of the target. Going 
towards the maximum of Pt is not an optimal strategy 
as the maximum does generally not coincide with the 
target, especially in the initial stage of the search process. 
Recently, Vergassola, Villermaux and Shraiman proposed 
an alternative strategy, called Infotaxis [7]. The searcher 
moves to maximize the (expected) gain in information 

t^) E-mail: barbieriQlps.ens.fr 



about the location of the target, that is, the loss in the 
entropy of the distribution Pt{y). As the search goes 
on, the entropy typically decreases, until the source 
is finally located. The strategy naturally balances the 
needs for exploration (harvesting more information about 
the target location) and exploitation (going towards 
the maximum of Pf). Infotaxis was implemented and 
tested on two-dimensional square lattices^: the target was 
almost always found, and the distribution of search time 
appeared to decay exponentially. 

Yet some important questions about Infotaxis remain 
open. First, how well does the algorithm perform in three 
dimensions? In addition to its practical interest, this 
question arises naturally in the context of the Brownian 
motion theory. Purely random walks are space-filling 
in two dimensions, and transient in higher-dimensional 
spaces. Finding a target in three dimensions is therefore 
much harder, and constitutes a real test for the capa- 
bilities of Infotaxis. Secondly, how dependent on the 
underlying lattice are the results reported in [7]? Realistic 
descriptions of animal behavior or implementations in 
biomimetic robots require us to consider continuous 
spaces. In addition the presence of a lattice introduces 
anisotropics, while the odor propagation model used 
in [7] was isotropic. Thirdly, two-dimensional trajectories 
seem to exhibit spiral-like shapes. How precisely can we 
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characterize those spirals, and what are their counterparts 
in three dimensions? 

In this letter, we derive the equation of motion for the 
Infotaxis searcher in the continuous space. We then intro- 
duce an algorithm to solve this equation^. The perfor- 
mances of Infotaxis, i.e. the probability of success and the 
distribution of the search times are studied in D = 2 and 
3 dimensions. The spiral- and coil-like shape of the search 
trajectories for, respectively, D = 2 and 3, and the pinning 
of the trajectory around the received hits are investigated 
analytically and numerically. We show that the motion of 
a searcher receiving an average and deterministic signal is 
a good predictor of the typical properties of the motion in 
the presence of stochastic hits. Finally we discuss a possi- 
ble extension to a non-greedy search strategy, which could 
help reduce pinning effects. 

Equation of motion for the searcher. — A point- 
source in y* emits particles, which diffuse in space, 
and have a finite lifetime. In the stationary regime, the 
probability per unit of time to encounter a particle in x is 
denoted by R{y* — x) (for supplementary information see 
ref. [7]). Function R has an integrable divergence at the 
origin {R{u) ~ — logu in D = 2, ~ ^ in D = 3 dimensions, 
when M— >0), and exponentially decreasing tails for large 
distances u. In the following distances are measured in 
the unit of the decay length of R. The unit of time is 
the inverse of the rate of emission of particles by the 
source, divided by the (dimensionless) linear size, a, of the 
searcher for D = 3, or multiplied by log(l/a) for Z) = 2 [7]. 

Let x(t) be the position of the searcher at time t. We 
denote by Nh the number of particles detected (called 
hits) at earlier times, Q^ti^t, with i = 1, . . . , Nh- Based 
on those hits, the searcher can draw a probabilistic 
map over the possible locations y of the source. In the 
Bayesian framework, the posterior probability density 
Pt(y) for the location of the source is the (normalized) 
product of the probabilities of having detected the Nh 
particles at locations x(ti), times the probability of not 
having detected any particle at other locations along the 
trajectory, times the prior probability density Pg over y, 

Nh 

Pt{y) oc n ^(y - x(i.)) e- ^0 Po(y). (1) 

i=l 

Hence, Pt diverges where the hits have been received and 
vanishes in the other places along the trajectory. In the 
following we will use brackets to denote averages over this 
posterior distribution: 

(/(y))y;t = /dyPt(y)/(y). (2) 

Assume now that the searcher stays in x(t) during an 
infinitesimal time 5t. The number of hits, n, received 
during this time interval is a stochastic variable equal to 

■^The code is publicly available from http://www.lps.ens.fr/ 
" barbieri. 



zero or one, with probabilities p(0|y) = 1 — StR{y — x) and 
p(l|y) =dtR{y — x.), depending on the location y of the 
source. In the language of information theory, the particle 
emission and detection system can be thought as a noisy 
channel, and n is the output message associated to the 
input codeword y. The mutual information 51 between n 
and y is 

(3) 

up to 0{St^), where 

is the entropy rate of the posterior distribution Pj. Info- 
taxis stipulates that SI should be maximized, or, equiva- 
lently that Vt should be minimized, i.e. made as negative 
as possible. In other words, interpreting Vt as a potential, 
the searcher should descend the gradient of Vt. A natural 
equation of motion is then 

7(t)i(<) = -VxI4(x(t)), (5) 

where j{t) plays the role of a friction coefficient. A possible 
choice is 7(i) = | VxVt(x(t))|/t;o, to keep the modulus of 
the velocity fixed and equal to vq. This is close to the 
lattice version of [7], where the searcher could either stay 
immobile or move by one lattice site, and the velocity 
could take only one non-zero value. In general, 7(f) can be 
any positive function of the time. In the following, we will 
first consider the case j(t) = 7 const. We will later discuss 
to what extent the trajectories and the performances 
change when 7 varies with time. 

Numerical integration. — The equation of motion (5) 
is highly non-linear and depends on the whole history of 
the search process through the posterior probability (1). 
To solve eq. (5) numerically we discretize the time with 
a step At. The positions x^ of the searcher at the 
instants t( = £ At, where ^ is a positive integer, are 
memorized, as well as the occurrences (times) of the hits. 
The amplitude of the ^-th elementary move is estimated 
from (5): x^+i — Xf = ~^^^7) ^xVt(x^). The calculation of 
the gradient of the potential requires to estimate averages 
over the space y with measure Pt (2). To do so, we 
use the importance sampling Monte Carlo method [9]. 
The term inside the bracket in (4), and its gradient 
with respect to x are exponentially decreasing functions 
of the radial distance u= |x£ — y|. We thus perform the 
change of variable u = uo{l — v) / v , where v €]0; 1], and uq 
is a scale parameter. We draw Nmc values of the new 
variable, Va, a =1,..., Nmc 7 uniformly and at random, 
and calculate the corresponding Ua- Angular variables ila 
are uniformly sampled on the unit sphere or circle to 
obtain the points y^ = x^ + Ua fla in the original space. 
The corresponding probabilities Pt{ya) aie computed 
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Fig. 1: Two-dimensional trajectories for 7 = .01 (A) and 
7 = .l (D), after one hit is received at time t = Q (Po=R). 
B: squared distance to the origin, d{t)^, vs. time t for four 
values of 7. C: spacing b between turnings vs. 7. The dotted 
line has slope Integration is done with Nmc = points. 



using Simpson's method to calculate the integral over time 
in (1). 

The algorithm of [7] stored and updated Pt, which made 
the computational time linear in t and in the size of the 
lattice. Our procedure recalculates Pt at each time step, 
which makes the computational time quadratic in t, and 
independent of the a priori infinite size of the space. Errors 
on Pt do not accumulate with time, and the accuracy is 
directly controlled by the number of Monte Carlo sampling 
points, Nmc- In addition, the map Pt is guaranteed to be 
accurately determined where it really matters, i.e. in the 
vicinity of the searcher. 

Initial stage of the search. — We first focus on the 
initial stage of the search process, which strongly depends 
on the a priori distribution, Pq. A possible choice is the 
one-hit prior, Pq = R, which means that the search starts 
only when a first hit is received at time t = Q [7] . In two 
dimensions and before subsequent hits are detected, the 
search trajectories are Archimedean spirals [8,10] for a 
large range of values of 7 (fig. 1 A) . The squared distance of 
the searcher to the origin at time t, d{t)'^, increases linearly 
with t, and is, to a large extent, independent of 7 (fig. IB). 
The spacing b between successive turnings is independent 
of time and increases as ^7 (fig. IC); the velocity |x| ~ ^ 
decreases with the friction 7 as expected. 

The increase of b can be intuitively understood: the 
larger 7, the longer the searcher spends along the trajec- 
tory without receiving hits, and the more likely is the 
source to be located far away. For values of 7(> .08) 
such that b would exceed the length (= 1) over which R 
decreases, the spiral nature of trajectories breaks down 
(fig. ID). We have run simulations with a modified 
potential Vt, where the arguments y — x and y' — x of the 



functions R in (4) were divided by a large factor (= 10). 
The regular spirals then disappeared, and looked like the 
trajectory in fig. ID, even for small values of 7. 

Simulations with other prior distributions show that the 
long-distance behavior of Pq is critical to the existence 
of spiral trajectories, while the behavior of Pq close to 
the origin is irrelevant. Choosing Po{y) ~ exp(— |y|/7/o) we 
obtain spirals as long as yo is not too large. The reason is 
that the spacing b is proportional to yg: spirals explore as 
much space as allowed by the prior. Spirals breakdown for 
the (7-dependent) value of yo such that b exceeds 1. 

Search trajectories in three dimensions display a more 
complex structure than their two-dimensional counter- 
parts. Figure 2 shows the motion of the searcher with 
the one-hit prior, Pq = R. Roughly speaking, the trajec- 
tory is constituted of subsequent shells of increasing radii, 
which are densely covered before a new shell is built. The 
distance to the origin, d{t) , is compatible with t^^^ at large 
times, but grows faster at smaller times (fig. 2). To better 
understand how trajectories develop in three dimensions, 
we have resorted to a small-x expansion of the potential 
Vt. The relevant contributions to the equation of motion 
are, up to cubic order, 

7i(i) =ai(t)x(t) + a2(t) / dt'x(i') 
Jo 



+ [ dt'x(t') 
Jo 



A(i)|x(t')|'+/32(t)x(t')-/dt"x(t") 





(6) 



where the coefficients ai{t) have explicit analytical 
expressions. When x is very small, only the linear terms 
matter. As qi~^^5£*>o for large t, the trajectory 
tends to follow a line radiating from the origin. However, 
the straight line is unstable against local bending since 
C(2 — — ^^^if-^ < 0. The trajectory thus acquires a spiral 
shape confined within the plane spanned by x(0) and 
x(0). The presence of cubic terms (/3i,/32 > 0), which are 
not constrained to lie in this plane, eventually lead to a 
cross-over from the quasi-bidimensional spiral to a fully 
three-dimensional trajectory (fig. 2). 

Replacing coefficients ai , a2 with the smaller values 
cti = ^,ce2 = with ai,a2>0, allows for an exact 

resolution of (6). A spiral is found when ai<02, with 
a radius and an angle growing as, respectively, and 
77 log t, where uj = "i^"^ and rj = \J 4a27 — (7 + ai)^. As 7 
increases so does the angular velocity (cxtj), while the 
growth exponent id diminishes: spirals stop growing if 
7 is too large, a fact reminiscent of fig. ID. Numerical 
resolution of (6) shows that this scenario is qualitatively 
unchanged when the logarithmic factors in Qi,a2 are 
taken into account. 

Pinning after a hit. — Examples of trajectories where 
the searcher receives hits at times ti > are shown in 
fig. 3. Generally speaking, the trajectories are denser when 
more hits are received. After each hit j, the posterior 
distribution Pt is considerably reinforced in x(ii) and its 
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Fig. 2: A three-dimensional trajectory in the absence of hit 
for 7 = .01 (left), with its quasi-two-dimensional initial portion 
(top); the time axis is color coded. Bottom: distance to the 
origin, d{t), compared to the power laws t'^^, then t^^^ . 



neighborhood. The searcher remains in the immediate 
vicinity for a certain time, £„, until the search resumes. 
Informally speaking, the searcher makes sure that the 
source is not in x(ii) before looking elsewhere. Imagine 
that the searcher has not moved at all for a period of time 
after the hit at time t^. The posterior distribution is 
then 



Pt,+tJy)cxP,-(y)i?(y-x(t,))e- 



-t^. fi(y-x(ti)) 



(7) 



Assuming the posterior distribution right before the hit, 
P^-, is smooth in the vicinity of x(£,;), the potential 
Vt.+t^ (x(tj) + u) is a function of the small displacement 
u = |u| and only. There are a local maximum in m = 0, 
since ai > in (7), and a global minimum in Um{tw) > 0. 
For < .4, < .01 is smaller than the error on the 
position deriving from our Monte Carlo integration, while 
for > .5, Um > .1, and the displacement of the searcher 
is easily seen. 

The pinning effect is an important feature of the 
continuous formulation of Infotaxis, and was not observed 
on a lattice. The reason is that, at each time step, a whole 
lattice site probability was set to zero in [7] , which created 
a strong repulsive effect for the searcher and prevented 
pinning. In the continuous space, however, the trajectory 
has zero measure and the repulsion is too weak to override 
the pinning. As the searcher gets closer to the source, 
the average delay t between successive hits gets smaller. 
When r~iu,, the searcher could, in principle, come to 
a complete halt. The distance from the source for which 
this happens, dhait, depends on the dimension: dhait = -1 
for D = 2, dhait = .3 for D = 3. 

Performances. — We now introduce a source and 
observe the trajectory of the searcher reacting to hits 
(fig. 3). We are interested in the probability that the search 
process is successful as a function of the initial distance to 
the source, dg. If the searcher reaches the neighborhood 
of the source of radius dhait defined above the source is 
declared found. If the searcher misses this neighborhood 
and reaches a distance dfaii S> 1 to the source such that 




Fig. 3: Examples of search trajectories with hits in D = 2 (top, 
7 = .02) and _D = 3 (bottom, 7 = .01) dimensions. Trajectories 
on the left find the source, while searches on the right are not 
successful. The initial distance to the source is do = 2. Red disks 
or spheres represent points at distance < dhait to the source. 
Black squares or cubes locate the hits; their size corresponds 
to the amplitude of the erratic motion of the searcher during 
the pinning after a hit. 
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Fig. 4: Probability of success of Infotaxis as a function of the 
initial distance to the source, do (left), and of the friction 7 
(right). Top points correspond to D = 2 dimensions, bottom 
points to D = 3. The numbers of runs is of about 200 for each 
point. All probabilities where obtained with dfaii = 8. 



new hits are highly unlikely, the searcher is declared to be 
lost. Examples of successful and unsuccessful trajectories 
are shown in fig. 3. 

Figure 4 (left) shows that the probability of success in 
dimension D = 2 is compatible with unity for all distances 
do smaller than a few units, in agreement with the findings 
of [7]. On the contrary, in dimension D = 3, the probability 
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Fig. 5: A: histograms of the search times is in D = 2 (7 = .02) 
and D = 3 (7 = .01) dimensions for an initial distance do = 2 
to the source. Full histograms correspond to the average 
trajectories, contour histograms to trajectories with random 
hits. B: average search time tg as a function of 7. C) Total 
CPU time as a function of 7, calculated as (ts/At)^. 
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Fig. 6: A: a search trajectory for the time-dependent 7(t) 
shown below. B: a trajectory with fixed velocity iiq = 2, and 
7(4) = |VxVf l/tJQ. The source is located in {\f2, \fi) (circle). 



of success is definitely smaller than one, and is about .8 
for distances dg ranging from 1 to 3 and for 7 = .01. We 
have also measured the probability of success at fixed dp 
over a range of values of 7 and found little variation (fig. 4, 
right). 

Figure 5A shows that the distribution of the search 
times tg of successful runs has a positive skew and a 
roughly exponential tail not only in dimension D = 2 [7] 
but also in dimension D = S. The exponential nature of 
the tail was checked for various values of the distance do- 
Figure 5B shows that the average time to find the source, 
ts, decreases with 7. The CPU time scales as A{ts/At)^, 
where At = 7 is chosen to obtain numerical stability, i. e. 
small enough local moves at any step £. Hence, the CPU 
time is much larger for small friction (fig. 5C). We find 
^~3ms on one core of a 2.4 GHz Intel Core 2 Quad 
desktop computer, and for Nmc = 10^ Monte Carlo steps. 
The CPU time can be decreased by choosing a smaller 
value for NmCj or by increasing 7 (and At) with time. 

Case of a time-dependent friction j{t). — The 
results reported so far correspond to constant frictions 7. 
We have generated trajectories with various time- 
dependent functions 'y{t), e.g. slowly increasing to reduce 
the scaling of the CPU time with t (fig. 6A), or with the 
fixed-velocity requirement (fig. 6B). From a qualitative 
point of view, we observe no drastic difference with the 
constant 7 case, provided that 7(t) does not exceed the 
maximal value (~ .08) at which 6~1 and spirals break 
down. For instance, the distance between turns in the 
spiral region of the trajectory in fig. 6A can be deduced 
from the value of 6(7(4)) in fig. IC. In the fixed-velocity 
case of fig. 6B, before the first hit is detected, 7(4) shows 
regular oscillations and the trajectory has a spiral-like 
shape with b corresponding to the maximum value of 
7(f)(~.011). Larger fiuctuations are visible during the 
erratic motion after each hit, but 7(4) remains smaller 
than the arrest value 7 ~ .08. 

We have seen in fig. 4 that the probability of success is 
essentially independent of do and 7. This key result can be 
understood as follows. Far away from the source, hits are 



very unlikely and the trajectory develops as a spiral. As 
the area spanned by the trajectory is roughly independent 
of 7 (fig. IB), the time it takes for the searcher to reach 
the Z)-dimensional sphere of radius 1 and centered in the 
source will be roughly independent of 7 and will strongly 
increase with do, while the time spent in the sphere and 
the number of hits received will be essentially independent 
of both 7 and do. We conclude that varying 7 has little 
consequence on the performance of Infotaxis, as long as 
the spiral-like motion is possible. 

Motion in the presence of the "average" signal. 

— Certain characteristics of the search time distribution, 
such as the typical (most probable) value of tg, can be 
assessed from the study of an abstract searcher receiving 
an average signal rather than discrete and sparse hits. To 
define an average posterior density Pi^{y) we remark that 
Pt{y) contains the product oi Nh stochastic R factors over 
the hits in (1). It is natural to define Pf" (y) through the 
average value of logPt(y) over the hits: 

(8) 

up to a multiplicative normalization constant; here, y* 
denotes the location of the source as usual. We define 
the average search trajectory as the solution of eq. (5) 
with the average (•) (2) calculated over the measure (8). 
Average search trajectories are obviously smoother than 
trajectories with random hits, but have common features, 
such as a possible return towards the origin after having 
been close to the source. 

Figure 5A shows that the search time for the average 
motion is in very good agreement with the typical search 
time for trajectories with random hits. This coincidence 
has been observed for all the frictions 7 and distance do 
to the source we have tested. Note that, while the average 
motion is fully deterministic, some noise is introduced to 
break the rotational invariance and select the initial phase 
of the spiral; this noise is responsible for the small width 
of the full histograms shown in fig. 5A. 
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large velocities. While this variational calculation appears 
intractable for general t, a systematic expansion in powers 
of T is possible. To the lowest orders in t the equation of 
motion becomes 
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Fig. 7: Entropy S{t) (bottom, left scale) for one trajectory 
x(i) obtained with random hits (top left, full curve, 3 hits 
are received) and the average trajectory (top right, dashed 
curve). The dotted line shows the average number of hits Nh 
(right scale) received along the average trajectory. The source 
is located in (v^, v^) (circle). 

Figure 7 shows the average search trajectory and a 
random trajectory in D = 2 dimensions, together with the 
entropies of their posterior distributions. In the random 
case, the entropy abruptly decreases right after a hit, 
then increases until the next hit is received (due to the 
exponential time decay in Pf (1)). As for the average 
motion, the entropy shows weak oscillations (due to the 
spiral motion) superimposed to a smooth trend, which 
decreases as the searcher gets close to the source. 

Conclusion. — In this letter we have presented a 
continuous-space version of Infotaxis, and have analyzed 
its behavior in two and three dimensions. When the initial 
distance to the source, dg, is of the order of the decay 
length of R, the probability that Infotaxis finds the target 
is essentially equal to unity in D = 2, and is smaller {~ .8) 
in D = 3 dimensions. The probability of success is roughly 
independent of 7(t) in (5), while the search time and the 
CPU time strongly depend on the friction. The quadratic 
increase of the CPU time and the presence of the pinning 
effect make the computational cost increase as the searcher 
gets closer to the source. Note that the CPU time could 
be made linear in ts (instead of quadratic) if the integral 
in (1) were restricted to the recent past, i.e. if the search 
had a finite memory. 

The pinning of the trajectory by the hits is a direct 
consequence of the greedy nature of Infotaxis: the searcher 
moves to maximize the immediate gain in information, 
irrespectively of what could be gained on a longer-time 
horizon. To overstep this greedy strategy consider the 
expected gain in information, /[x(i'); t < t' < t + t], when 
the searcher plans to move along a portion of trajectory 
x(t') during the current time t and the time t + r. The 
best portion of trajectory is determined through the maxi- 
mization of /, minus a quadratic term oc 7 penalizing 



T'(VxVxl/t(x)) i + 7i= -Vxl/t(x) 



(9) 



up to 0{t) corrections to the friction 7 and to the force 
on the right-hand side. The introduction of a finite-time 
horizon, r, gives birth to an inertial term, with an effective 
mass tensor proportional to the curvature matrix of the 
potential (4). This inertial motion could help reduce the 
pinning following a hit, and avoid the slowing-down of 
the searcher close to the source. The analysis of (9) and of 
the search trajectories is left for a future work. 

Last of all, the shapes of the trajectories observed in 
two and three dimensions result from a trade-off between 
the self-repulsion of the trajectory (the searcher does not 
come again close to a point where the source was not 
detected) and the confinement due to the hits or to the 
prior (the source is likely to be close to a detection). 
This trade-off is present in physical systems such as 
polyelectrolytes (charged polymers) confined in a volume 
or on a surface [11]. It is however unclear how far the 
analogy between the out-of-equilibrium process generated 
by Infotaxis and equilibrium polyelectrolytes could be 
pursued [12]. 
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Appendix A 

Inference of couplings for a set of 
leaky integrate and fire neurons 



A.l Introduction 



Recent advances in experimental techniques and in the miniaturization of components have 
permitted the recording of the activity of several neurons at the same time thorugh the use 
of multi-electrode recordings [Taketani 06j . 

The observation of substantial correlations in the firing activities of neurons has raised fun- 
damental issues on their functional role [Averbeck 06j . However the problem of inferring the 
structure of the network and the interaction between different neurons has only recently been 
attacked (Fig. A.l). The problem is not easy to tackle, because data sets are already quite 
big and can contain millions of spiking events from up to a hundred neurons. 
A classical approach to infer functional neural connectivity is through the analysis of pair- 
wise cross-correlations. The approach is versatile and fast, but cannot disentangle direct 
correlations from common or correlated inputs. Alternative approaches assume a particular 
dynamical model for the spike generation, such as the generalized linear model, which rep- 
resents the generation of spikes as a Poisson process with a time-dependent rate, and the 
Integrate-and-Fire (IF) model, where spikes are emitted according to the dynamics of the 
membrane potential [Jolivet 04] . 

While the problem of estimating the model parameters (external current, variance of the 
noise, capacitance and conductance of the membrane, ...) of a single stochastic IF neuron 
from the observation of a spike train has received a lot of attention [Paninski 04| [Lansky 08] , 
few studies have focused on the inference of interactions in an assembly of IF neurons. Cocco 
and Monasson have recently proposed a Bayesian algorithm to infer the intereactions and the 
external currents of a set of leaky integrate and fire neurons [Cocco 09] IMonasson 11] . They 
applied their approach to data coming from real experiments on salamander retinas and val- 
idated their results with artificial data and cross-checking with another algorithm based on 
the Ising model. 

An interesting problem they've come across is the disentanglement of the correlations already 
present in the stimulus and the correlations that come from the topology of the network itself. 
This has been discussed by comparing datasets coming from the same retina with different 
stimuli. 
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Figure A.l: Left: times ti^k of spikes emitted by a set of neurons (raster plot). Right: network 
of LIF neurons with couphngs Jij and external currents !{. Given the set of spikes we want 
to infer the values of the couplings and currents. 



A. 2 Integrate and fire neurons 

Each neuron is represented by the Leaky Integrate-and-Fire (LIF) model (see jJolivet 04] and 
references therein). The membrane potential obeys the differential equation, 

C^{t) = -gV,{t)+ J2 J^:i Y.^{t-h,k) + Ii + m{t) , (A.l) 

where C, g are, respectively, the membrane capacitance and conductance. Jij is the strength 
of the connection from neuron j onto neuron i and tj^^ the time at which cell j fires its k^^ 
spike; we assume that synaptic inputs are instantaneously integrated i. e. the synaptic integra- 
tion time is much smaller than the membrane leakage time, C/g, and the typical inter-spike 
interval. It is a constant external current flowing into cell i, and r/i(i) is a fluctuating current, 
modeled as a Gaussian white noise process with variance a^. Neuron i remains silent as long 
as Vi remains below the threshold potential Vth (set to unity in the following) . If the threshold 
is reached at some time then a spike is emitted, and the potential is reset to its rest value 
(which can be set to zero without loss of generality), and the dynamics resumes. 



The above model ( A.l ) implicitly defines the likelihood P of the spiking times {tj^k} given the 
currents /j and synaptic couplings Jij. If we are given the spiking times {tj,fc} we will infer 
the couplings and currents by maximizing P. In principle P can be calculated through the 
resolution of Fokker-Planck equations (one for each inter-spike interval) for a one-dimensional 
Orstein-Uhlenbeck process with moving boundaries. However this approach, or related numer- 
ical approximations [Paninski 04j . are inadequateis too slow to treat data sets with hundreds 
of thousands of spikes. 

In Cocco's and Monasson's approach P is approximated from the contribution coming from 
the most probable trajectory for the potential for each cell i, referred to as V*{t). This semi- 
classical approximation is exact when the amplitude a of the noise is small. The determination 
of V* (t) was done numerically by Paninski for one cell in [Paninski 06j . What they proposed 
is a fast algorithm to determine V*{t) analytically in a time growing linearly with the num- 
ber of spikes and quadratically with the number of neurons, which allows them to process 
recordings with tens of neurons easily. The algorithm is based on a detailed and analytical 
resolution of the coupled equations for the optimal potential V* (t) and the associated optimal 



noise i]*{t) through (A.l), since this work has not been performed during this thesis and a 
full explanation would take many pages we will not discuss the details of the algorithm. The 
interest reader can find an explanation of the approach in a recent publication [Monasson llj . 
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APPENDIX A. INFERENCE OF COUPLINGS FOR A SET OF LEAKY INTEGRATE AND 

FIRE NEURONS 



Once the optimal paths for the potential and noise have been determined, one cancalculate 
the log-hkehhood of the corresponding couphngs and currents through the integral of the 
squared optimal noise. This log-likelihood is a concave function of the currents and couplings 
and can be easily maximized using convex optimization procedures. Measure of the curvature 
of the log-likelihood allows us to estimate the error bars on the inferred parameters. 

A. 3 Limitations of the original implementation 

Even though the conception of the algorithm and the subsequent testing performed by Cocco 
and Monasson have been very thorough, distribution of software through its source code can 
scare all but the most tech savy scholars in the field. 

Moreover the algorithm was originally written in non-standard Fortran 77 that was only 
compatible with g77 and not with gfortran. As of version 3.4 of GCC (released in May 2006) 
development of g77 has stopped and users are encouraged to use gfortran. 
Because of this most modern Linux distributions do not come with g77 preinstalled and users 
who need g77 have to install it separately sometimes compiling the compiler itself. 
Another important hurdle to overcome before the public release of this software was the fact 
that it originally contained parts of code that were covered by copyright [Press 86] and could 
not be reused freely. 

An important part of the implementation of the original program relied on the implementation 
of the classical Newton method for multidimensional minimzation. Knowing that the function 
to minimize is convex in the parameters guarantees the convergence of the algorithm, howver 
it is not at all clear that among all minimization techniques Newton's would be the faster in 
all cases. 

In fact Newton's method relies on the exact computation of the Hessian matrix which is 
computatinally taxing and sometimes possible only in an approximate form, because of this 
we have chosen to reimplement the software in a way that permitted a modular change of 
minimzation techniques. 

Further limitations included the lack of inferenence of certain parameters of the model such 
as the leaking constant which was fixed at the beginning of the inference and considered 
equal for all neurons. 

Because of this we have decided to translate the problem in standard C which is a far more 
widespread programming language, arguably the most common. Compilers in C are available 
on virtually every architecture, and there are a variety of free open-source numerical libraries 
that can be effortlessly used for the implementation of minimization techniques and special 
functions. 

Morevore standard C code is very easily integrated in more higher level computational software 
such as Matlab which is very widely used in the computational biology community. 
We believe that these contributions, even though it is not an algorithmic effort, but of a more 
mundane nature can be of great use in the diffusion of this algorithm and its use. 

A. 4 Description of the software package 

The software package is written in standard C and the source code will soon be available for 
download. C was the natural choice for a program that can be used either as a stand-alone 
executable or called from widely used computational software such as Matlab [MathWorks llj 
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and Mathematica |Wolfram Research llj . 
The software is composed of several functions: 

• A function that reads data in the form of spike trains, i.e. a two column file where the 
first column is the time at which the spike was emitted and the second is an integer 
value that identifies the neuron responsible for that spike. Data are then stored in data 
structures of variable size, to be conveniently accessed by other functions. 

• A function that goes through the data and identifies all spikes incoming to a cell in the 
time interval between two successive spikes of that cell. This function also performs a 
check on collision, that is, multiple spikes emitted by the same cell at the same time. 

• A function that computes the log-likelihood and its first and second derivatives with 
respect to the interactions and the current. Note that the time-consuming calculation 
of second derivatives can be switched off if the minimization algorithm employed does 
not require them. 

These functions have appropriate wrappers that allow for use with the minimization rou- 
tines available from the GNU Scientific Library (GSL) |GSL llj and Matlab. The choice of 
minimization technique and related parameter is left to the user, but we have observed the 
gnewton technique of the GSL to be the fastest in most cases. 



The user chooses the values of the parameters of the LIF model (A.l); when the leaking 
constant g is set to zero, that is when the integrator is non- leaky, a specific and much faster 
program is used. Otherwise the most likely value for g can be inferred from the data for 
each neuron. Also available to the user specifications is a choice of priors over the coupling 
values, based on the Li and L2 norms, which can be used to ensure convergence to realistic 
values and/or to eliminate couplings which are very close to zero. The program can be easily 
modified to add further specific priors. The user can further improve upon the instantaneous 



synaptic integration assumption in the model (A.l). To do so an option allows the user to 
introduce a synaptic reweighing function, replacing Jij with Jij x K[ti^y — tj^^), where tj^k is 
the time of the spike fired by cell j and entering cell i, and ti^k' is the next spiking time of cell 
i; K{x) = for x = and K{x) ~ 1 for x > Ts, the synaptic integration time. 
The output data can be printed to a file or to a Matlab array. The file is composed of three 
columns, the first two denote the indices of the coupling matrix Jij and the third the value 
of the coupling constant. The diagonal elements of the matrix Ja are the currents /j. 



142 



A.4. DESCRIPTION OF THE SOFTWARE PACKAGE 



Bibliography 



[Adler 66] 
[Angelescu 08] 



[Ashkin 70] 
[Ashkin 86] 

[Atkinson 69] 



[Averbeck 06] 

[Baldazzi 06] 

[Baldazzi 07] 

[Balkovsky 02] 

[Barany 91] 
[Barbieri 07] 



J. Adler. Chemotaxis in Bacteria. Science, vol. 153, pages 708-716, 
1966. 

D. G. Angelescu, P. Linse, T. T. Nguyen & R. F. Bruinsma. Structural 
transitions of encapsidated poly electrolytes. The European Physical 
Journal E - Soft Matter and Biological Physics, vol. 25, no. 3, pages 
323-334, 2008. 

A. Ashkin. Acceleration and trapping of particles by radiation pres- 
sure. Physical Review Letters, vol. 24, no. 4, pages 156-159, 1970. 

A. Ashkin, J. M. Dziedzic, J. E. Bjorkholm 8z S. Chu. Observation of 
a single-beam gradient force optical trap for dielectric particles. Optics 
letters, vol. 11, no. 5, pages 288-290, 1986. 

M. R. Atkinson, M. P. Dcutschcr, A. Kornbcrg, A. F. Russell k J. G. 
Moffatt. Enzymic synthesis of deoxyribonucleic acid. XXXIV. Termi- 
nation of chain growth by a 2', 3 '-dideoxyribonucleotide. Biochemistry, 
vol. 8, no. 12, pages 4897-4904, 1969. 

B. B. Averbeck, P. E. Latham & A. Pouget. Neural correlations, 
population coding and computation. Nature Reviews Neuroscience, 
vol. 7, no. 5, pages 358-366, 2006. 

V. Baldazzi, S. Cocco, E. Marinari Sz R. Monasson. Inference of DNA 
Sequences from Mechanical Unzipping: An Ideal-Case Study. Physical 
Review Letters, vol. 96, no. 12, page 128102, 2006. 

V. Baldazzi, S. Braddc, S. Cocco, E. Marinari & R. Monasson. In- 
ferring DNA sequences from, m,echanical unzipping data: the large- 
bandwidth case. Physical Review E, vol. 75, no. 1, page 011904, 2007. 

E. Balkovsky &: B. I. Shraiman. Olfactory search at high Reynolds 
number. Proceedings of the National Academy of Sciences, vol. 99, 
no. 20, page 12589, 2002. 

F. Barany. The ligase chain reaction in a PGR world. Genome Re- 
search, vol. 1, no. 1, page 5, 1991. 

C. Barbieri. Infotassi: un algoritmo di ricerca senza gradienti. Mas- 
ter's thesis, Universita di Roma "La Sapienza" , December 2007. 



143 



C. Barbieri, S. Cocco, R. Monasson &: F. Zamponi. Dynamical model- 
ing of molecular constructions and setups for DNA unzipping. Physi- 
cal Biology, vol. 6, page 025003, 2009. 

C. Barbieri, S. Cocco & R. Monasson. On the trajectories and per- 
formance of Infotaxis, an information-based greedy search algorithm. 
Europhysics Letters, vol. 94, no. 2, page 20005, 2011. 

T. Bayes. An Essay towards Solving a Problem in the Doctrine of 
Chances. Philosophical Transactions of the Royal Society, vol. 53, 
pages 370-418, 1763. 

T. Bayes. An Essay towards Solving a Problem in the Doctrine 
of Chances. Reprint in modern notation of \Bayes 6^ . Biometrika, 
vol. 45, pages 293-315, 1958. 

H. C. Berg. Chemotaxis in Bacteria. Annual Review of Biophysics 
and Bioengineering, vol. 4, pages 119-136, 1975. 

H. C. Berg. A physicist looks at bacterial chemotaxis. In Cold Spring 
Harbor Symposium on Quantitative Biology, volume 53, pages 1-9, 
1988. 

C. Bouchiat, M. D. Wang, J. F. Allemand, T. Strick, S. M. Block 
& V. Croquette. Estimating the persistence length of a worm-like 
chain molecule from force- extension measurements. Biophysical jour- 
nal, vol. 76, no. 1, pages 409-413, 1999. 

K. J. Breslauer, R. Frank, H. Blocker & L. A. Marky. Predicting DNA 
duplex stability from the base sequence. Proceedings of the National 
Academy of Sciences, vol. 83, no. 11, page 3746, 1986. 

J. J. Cerda, T. Sintes & A. Chakrabarti. Excluded volume effects 
on polymer chains confined to spherical surfaces. Macromolecules, 
vol. 38, no. 4, pages 1469-1477, 2005. 

J. Clarke, H. C. Wu, L. Jayasinghe, A. Patel, S. Reid k. H. Bayley. 

Continuous base identification for single-molecule nanopore DNA se- 
quencing. Nature nanotechnology, vol. 4, no. 4, pages 265-270, 2009. 

S. Cocco, R. Monasson &: J. F. Marko. Force and kinetic barriers 
to unzipping of the DNA double helix. Proceedings of the National 
Academy of Sciences, vol. 98, no. 15, page 8608, 2001. 

S. Cocco, J. F. Marko &: R. Monasson. Slow nucleic acid unzipping ki- 
netics from sequence-defined barriers. The European Physical Journal 
E - Soft Matter and Biological Physics, vol. 10, no. 2, pages 153-161, 
2003. 

S. Cocco, S. Leibler & R. Monasson. Neuronal couplings between 
retinal ganglion cells inferred by efficient inverse statistical physics 



BIBLIOGRAPHY 



BIBLIOGRAPHY 



[Crothers 64] 



[Curtis 02] 



[Danilowicz 03] 



[Doi 86] 

[Elowitz 02] 

[Eyring 35] 
[Flory 53] 

[Friedman 06] 



[Gal 80] 
[Glessner 10] 



[Gosse 02] 



[GSL 11] 



methods. Proceedings of the National Academy of Sciences, vol. 106, 
no. 33, page 14058, 2009. 

D. M. Crothers & B. H. Zimm. Theory of the melting transition 
of synthetic polynucleotides: Evaluation of the stacking free energy*. 
Journal of Molecular Biology, vol. 9, no. 1, pages 1-9, 1964. 

J. E. Curtis, B. A. Koss & D. G. Grier. Dynamic holographic optical 
tweezers. Optics Communications, vol. 207, no. 1-6, pages 169-175, 
2002. 

C. Danilowicz, V. W. Coljee, C. Bouzigues, D. K. Lubensky, D. R. Nel- 
son & M. Prentiss. DNA unzipped under a constant force exhibits mul- 
tiple metastable intermediates. Proceedings of the National Academy 
of Sciences, vol. 100, no. 4, page 1694, 2003. 

M. Doi & S. F. Edwards. The Theory of polymer dynamics. 
Numeero 73 in International Series of Monographs on Physics. Claren- 
don Press, 1986. 

M.B. Elowitz, A.J. Levine, E.D. Siggia &: P.S. Swain. Stochastic gene 
expression in a single cell. Science, vol. 297, no. 5584, page 1183, 
2002. 

H. Eyring. The activated complex in chemical reactions. The Journal 
of Chemical Physics, vol. 3, page 107, 1935. 

P. J. Flory. Principles of polymer chemistry. George Fisher Baker 
non-resident lectureship in chemistry at Cornell University. Cornell 
University Press, 1953. 

J. M. Friedman, A. Baross, A. D. Delaney, A. Ally, L. Arbour, 
J. Asano, D. K. Bailey, S. Barber, P. Birch, M. Brown-Johnei al. 

Oligonucleotide microarray analysis of genomic imbalance in children 
with mental retardation. The American Journal of Human Genetics, 
vol. 79, no. 3, pages 500-513, 2006. 

S. Gal. Search games. Academic Press, 1980. 

J. T. Glessner, M. P. Reilly, C. E. Kim, N. Takahashi, A. Albano, 
C. Hou, J. P. Bradfield, H. Zhang, P. Sleiman, J. H. Flory ei al. Strong 
synaptic transmission impact by copy number variations in schizophre- 
nia. Proceedings of the National Academy of Sciences, vol. 107, no. 23, 
page 10584, 2010. 

C. Gosse k. V. Croquette. Magnetic tweezers: micromanipulation and 
force measurement at the molecular level. Biophysical Journal, vol. 82, 
no. 6, pages 3314-3329, 2002. 



GSL. 



GSL documentation - Multidimensional Root-finding. 



http : //www . gnu . org/ software/ gsl/manual/html_node/ 



Multidimensional-Root_002dFinding.htiiil 2011. 



BIBLIOGRAPHY 



145 



J. p. Huelsenbeck, F. Ronquist, R. Nielsen &; J.P. BoUback. Bayesian 
inference of phylogeny and its impact on evolutionary biology. Science, 
vol. 294, no. 5550, page 2310, 2001. 

J. M. Huguet, N. Forns & F. Ritort. Statistical properties of metastahle 
intermediates in DNA unzipping. Physical review letters, vol. 103, 
no. 24, page 248106, 2009. 

J. M. Huguet, C. V. Bizarro, N. Forns, S. B. Smith, C. Bustamantc &: 
F. Ritort. Single-molecule derivation of salt dependent base-pair free 
energies in DNA. Proceedings of the National Academy of Sciences, 
vol. 107, no. 35, page 15431, 2010. 

A. J. lafrate, L. Feuk, M. N. Rivera, M. L. Listewnik, P. K. Donahoe, 
Y. Qi, S. W. Scherer &; C. Lee. Detection of large-scale variation in 
the human genome. Nature genetics, vol. 36, no. 9, pages 949-951, 
2004. 

H. M. James &: E. Guth. Theory of the elastic properties of rubber. 
The Journal of Chemical Physics, vol. 11, page 455, 1943. 

R. Jolivet, T. J. Lewis & W. Gerstner. Generalized integrate- and- 
fire models of neuronal activity approximate spike trains of a de- 
tailed model to a high degree of accuracy. Journal of Neurophysiology, 
vol. 92, no. 2, page 959, 2004. 

E. F. Keller & L. A. Segel. Initiation of Slime Mold Aggregation 
Viewed as an Instability. Journal of Theoretical Biology, vol. 26, no. 3, 

pages 399-415, 1970. 

M. Kcrkcr. The scattering of light and other electromagnetic radia- 
tion. Academic Press, 1969. 

A. Koike, N. Nishida, D. Yamashita &: K. Tokunaga. Comparative 
analysis of copy number variation detection methods and database con- 
struction. BMC genetics, vol. 12, no. 1, page 29, 2011. 

H. A. Kramers. Brownian motion in a field of force and the diffusion 
model of chemical reactions. Physica, vol. 7, no. 4, pages 284-304, 
1940. 

W. Kuhn & F. Griin. Beziehungen zwischen elastischen Konstanten 
und Dehnungsdoppelbrechung hochelastischer Stoffe. Colloid &z, Poly- 
mer Science, vol. 101, no. 3, pages 248-271, 1942. 

P. Lansky & S. Ditlevsen. A review of the methods for signal estima- 
tion in stochastic diffusion leaky integrate- and- fire neuronal models. 
Biological cybernetics, vol. 99, no. 4, pages 253-262, 2008. 

M. J. Levene, J. Korlach, S. W. Turner, M. Foquet, H. G. Craighead 
&: W. W. Webb. Zero-mode waveguides for single-molecule analysis 
at high concentrations. Science, vol. 299, no. 5607, page 682, 2003. 



BIBLIOGRAPHY 



BIBLIOGRAPHY 



[MacKay 05] 
[Mangeol 08] 

[Manosas 05] 

[Marko 94] 
[Marko 95] 
[Masson 09] 

[MathWorks 11] 
[Maxam 77] 

[McKernan 09] 



[Monasson 11] 



D. J. C. MacKay. Information theory, inference, and learning algo- 
rithms. Cambridge University Press, 4*^ edition, 2005. 

P. Mangeol & U. Bockelmann. Interference and crosstalk in double 
optical tweezers using a single laser source. Review of Scientific In- 
struments, vol. 79, page 083103, 2008. 

M. Manosas & F. Ritort. Thermodynamic and kinetic aspects of RNA 
pulling experiments. Biophysical journal, vol. 88, no. 5, pages 3224— 
3242, 2005. 

J. F. Marko & E. D. Siggia. Bending and twisting elasticity of DNA. 
Macromolecules, vol. 27, no. 4, pages 981-988, 1994. 

J. F. Marko & E. D. Siggia. Stretching dna. Macromolecules, vol. 28, 
no. 26, pages 8759-8770, 1995. 

J. B. Masson, M. B. Bechet & M. Vergassola. Chasing information to 
search in random environments. Journal of Physics A: Mathematical 
and Theoretical, vol. 42, page 434009, 2009. 



[Moroz 97] 



MathWorks. Matlab support - MEX-files guide. http://www. 
imathworks . com/ support/tech-notes/ 1600/ 1605 . html , 2011. 

A. M. Maxam & W. Gilbert. A new method for sequencing DNA. 
Proceedings of the National Academy of Sciences, vol. 74, no. 2, page 
560, 1977. 

K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, 
Y. Fu, E. F. Tsung, C. R. Clouser, C. Duncan, J. K. Ichikawa, C. C. 
Lee, Z. Zhang, S. S. Ranade, E. T. Dimalanta, F. C. Hyland, T. D. 
Sokolsky, L. Zhang, A. Sheridan, H. Fu, C. L. Hendrickson, B. Li, 
L. Kotler, J. R. Stuart, J. A. Malek, Jo. M. Manning, A. A. An- 
tipova, D. S. Perez, M. P. Moore, K. C. Hayashibara, M. R. Lyons, 
R. E. Beaudoin, B. E. Coleman, M. W. Laptewicz, A. E. Sannican- 
dro, M. D. Rhodes, R. K. Gottimukkala, S. Yang, V. Bafna, A. Bashir, 
A. MacBride, C. Alkan, J. M. Kidd, E. E. Eichler, M. G. Reese, F. M. 
De La Vega & A. P. Blanchard. Sequence and structural variation in 
a human genome uncovered by short-read, massively parallel ligation 
sequencing using two-base encoding. Genome Research, vol. 19, no. 9, 
pages 1527-1541, 2009. 

R. Monasson & S. Cocco. Fast inference of interactions in assemblies 
of stochastic integrate- and- fire neurons from spike recordings. Journal 
of Computational Neuroscience, pages 1-29, 2011. 10.1007/sl0827- 
010-0306-8. 

J. D. Moroz & P. Nelson. Torsional directed walks, entropic elasticity, 
and DNA twist stiffness. Proceedings of the National Academy of 
Sciences, vol. 94, no. 26, page 14418, 1997. 



BIBLIOGRAPHY 



147 



A. Mossa, J. M. Huguet & F. Ritort. Investigating the thermodynamics 

of sm,all hiosystems with optical tweezers. Physica E: Low-dimensional 
Systems and Nanostructures, vol. 42, no. 3, pages 666-671, 2010. 

K. B. Mullis, F. A. Faloona, S. J. Scharf, R. K. Saiki, G. T. Horn & 

H. Erlich. Specific enzymatic amplification of DNA in vitro: the poly- 
merase chain reaction. In Cold Spring Harbor Symposia on Quanti- 
tative Biology, volume 51, page 263. Cold Spring Harbor Laboratory 
Press, 1986. 

K. B. Mullis, F. Ferre &: R. A. Gibbs. The polymerase chain reaction. 
Birkhauser Boston Inc., 1994. 

T. Odijk. Stiff chains and filaments under tension. Macromolecules, 
vol. 28, no. 20, pages 7016-7018, 1995. 

L. Paninski, J. W. Pillow & E. P. Simoncelli. Maximum likelihood 
estimation of a stochastic integrate- and- fire neural encoding model. 
Neural Computation, vol. 16, no. 12, pages 2533-2561, 2004. 

L. Paninski. The most likely voltage path and large deviations approx- 
imations for integrate- and- fire neurons. Journal of Computational 
Neuroscience, vol. 21, no. 1, pages 71-87, 2006. 

W. H. Press, S. A. Teukolsky, W. T. Wetterling & B. P. Flannery. 
Numerical recipes in fortran 77. the art of scientific computing. Cam- 
bridge University Press, 2 edition, 1986. 

K. B. Raper. Dictyostelium discoideum, a new species of slime mold 
from decaying forest leaves. Journal of Agricultural Research, vol. 50, 
pages 135-147, 1935. 

K. B. Rapcr. Pseudoplasmodium formation and, organization in Dic- 
tyostelium discoidcum. Journal of the Elisha Mitchell Scientific Soci- 
ety, vol. 56, pages 241-282, 1940. 

S. Redner. A guide to first-passage processes. Cambridge University 
Press, 2001. 

M. Ronaghi, S. Karamohamed, B. Pettersson, M. Uhlen & P. Nyren. 
Real-time DNA sequencing using detection of pyrophosphate release. 
Analytical biochemistry, vol. 242, no. 1, pages 84-89, 1996. 

M. Ronaghi, M. Uhlen & P. Nyrcn. A sequencing method based on 
real-time pyrophosphate. Science, vol. 281, no. 5375, pages 363-365, 
1998. 

P. E. Rouse Jr. A Theory of the Linear Viscoelastic Properties of 
Dilute Solutions of Coiling Polymers. Journal of Chemical Physics, 
vol. 21, pages 1272-1280, 1953. 



BIBLIOGRAPHY 



BIBLIOGRAPHY 



[Sanger 75] 
[Sanger 77] 
[Sebat 04] 

[Sebat 07] 

[Segall 86] 

[Shaffer 53] 
[Slilien 10] 
[Slosar 06] 
[Smitli 92] 

[Smitli 96] 

[Smolucfiowski 17] 

[Staden 79] 
[Stokes 51] 

[Sundaram 10] 



F. Sanger &; A. R. Coulson. A rapid method for determining sequences 
in DNA by primed synthesis with DNA polymerase. Journal of Molec- 
ular Biology, vol. 94, no. 3, pages 441-446, 1975. 

F. Sanger, S. Nicklen & A. R. Coulson. DNA sequencing with chain- 
terminating inhibitors. Proceedings of the National Academy of Sci- 
ences, vol. 74, no. 12, page 5463, 1977. 

J. Sebat, B. Lakshmi, J. Troge, J. Alexander, J. Young, P. Lundin, 
S. Maner, H. Massa, M. Walker, M. Chiei al. Large-scale copy number 
polymorphism in the human genome. Science, vol. 305, no. 5683, page 
525, 2004. 

J. Sebat, B. Lakshmi, D. Malhotra, J. Troge, C. Lese-Martin, 

T. Walsh, B. Yamrom, S. Yoon, A. Krasnitz, J. Kendallet al. Strong 
association of de novo copy number mutations with autism. Science, 
vol. 316, no. 5823, page 445, 2007. 

J. E. Segall, S. E. Block &; H. C. Berg. Temporal Comparisons in Bac- 
terial Chemotaxis. Proceedings of the National Academy of Sciences, 
vol. 83, no. 23, pages 8987-8991, 1986. 

B. M. Shaffer. Aggregation in cellular slime molds: in vitro isolation 
of acrasin. Nature, vol. 171, pages 975-977, 1953. 

A. Shlien & D. Malkin. Copy number variations and cancer suscepti- 
bility. Current opinion in oncology, vol. 22, no. 1, page 55, 2010. 

A. Slosar & R. Podgornik. On the connected- charges Thomson prob- 
lem. Europhysics Letters, vol. 75, page 631, 2006. 

S. B. Smith, L. Finzi & C. Bustamante. Direct mechanical measure- 
ments of the elasticity of single DNA molecules by using magnetic 
beads. Science, vol. 258, no. 5085, pages 1122-1126, 1992. 

S. B. Smith, Y. Cui & C. Bustamante. Overstretching B-DNA: The 
Elastic Response of Individual Double- Stranded and Single- Stranded 
DNA Molecules. Science, vol. 271, no. 5250, page 795, 1996. 

M. V. Smoluchowski. Versuch einer mathematischen Theorie des 
Koagulationeskinetik kolloider Losungen. Zeitschrift fiir physicalische 
Chimie, vol. 92, pages 129-168, 1917. 

R. Staden. A strategy of DNA sequencing employing computer pro- 
grams. Nucleic Acids Research, vol. 6, no. 7, page 2601, 1979. 

G. G. Stokes. On the Effects of the Internal Friction of Fluids on the 
Motion of Pendulums. Transactions of the Cambridge Philosophical 
Society, vol. IX, page 8, 1851. 

S. K. Sundaram, A. M. Huq, B. J. Wilson & H. T. Chugani. Tourette 
syndrome is associated with recurrent exonic copy number variants. 
Neurology, vol. 74, no. 20, page 1583, 2010. 



BIBLIOGRAPHY 



149 



BIBLIOGRAPHY 



[Taketani 06] 
[Tinoco 71] 

[Tinoco 73] 

[Uhlenbeck 30] 

[Vergassola 07a] 
[Vergassola 07b] 

[Vickers 94] 



[Viterbi 67] 

[Wallmark 57] 
[Wiedmann 94] 



M. Taketani & M. Baudry. Advances in network electrophysiology: 
using multi-electrode arrays. Springer, 2006. 

I. Tinoco, O. C. Uhlenbeck & M. D. Levine. Estimation of secondary 
structure in ribonucleic acids. Nature, vol. 230, no. 5293, pages 362- 
367, 1971. 

I. Tinoco, P. N. Borer, B. Dengler, M. D. Levine, O. C. Uhlenbeck, 
D. M. Crothers & J. Gralla. Improved estimation of secondary struc- 
ture in ribonucleic acids. Nature, vol. 246, no. 150, pages 40-41, 1973. 

G. E. Uhlenbeck & L. S. Ornstein. On the theory of the Brownian 
motion. Physical Review, vol. 36, no. 5, page 823, 1930. 

M. Vergassola. Private communication. 2007. 

M. Vergassola, E. Villermaux &; B. I. Shraiman. 'Infotaxis' as a Strat- 
egy for Searching without Gradients. Nature, vol. 445, pages 406-409, 
2007. 

N. J. Vickers & T. C. Baker. Reiterative responses to single strands 
of odor promote sustained upwind flight and odor source location by 
moths. Proceedings of the National Academy of Sciences, vol. 91, 
no. 13, page 5756, 1994. 

A. Viterbi. Error bounds for convolutional codes and an asymptoti- 
cally optimum decoding algorithm. IEEE Transactions on Information 
Theory, vol. 13, no. 2, pages 260-269, 1967. 

J. T. Wallmark. A new semiconductor photocell using lateral photo- 
effect. Proceedings of the IRE, vol. 45, no. 4, pages 474-483, 1957. 

M. Wiedmann, W. J. Wilson, J. Czajka, J. Luo, F. Barany & 
C. A. Batt. Ligase chain reaction (LCR) -overview and applications. 
Genome Research, vol. 3, no. 4, page S51, 1994. 



[Wolfram Research 11] Wolfram Research. Mathematica support C/C-h-h Language 

Interface. http: //reference .wolfram. coin/niatheniatica/guide/| 
[CLanguagelnterf ace . htmll 2011. 



[Woodside 06a] 



[Woodside 06b] 



M. T. Woodside, P. C. Anthony, W. M. Behnke-Parks, K. Larizadeh, 
D. Herschlag & S. M. Block. Direct measurement of the full, sequence- 
dependent folding landscape of a nucleic acid. Science, vol. 314, 
no. 5801, page 1001, 2006. 

M. T. Woodside, W. M. Behnke-Parks, K. Larizadeh, K. Travers, 
D. Herschlag & S. M. Block. Nanomechanical measurements of the 
sequence- dependent folding landscapes of single nucleic acid hairpins. 
Proceedings of the National Academy of Sciences, vol. 103, no. 16, 
pages 6190-6195, 2006. 



150 



BIBLIOGRAPHY 



BIBLIOGRAPHY 



[Wu 89] D. Y. Wu & R. B. Wallace. The ligation amplification reaction (LAR)- 

amplification of specific DNA sequences using sequential rounds of 
template- dependent ligation. Genomics, vol. 4, no. 4, pages 560-569, 
1989. 

[Zou 05] M. Zou &: S.D. Conzen. A new dynamic Bayesian network (DBN) 

approach for identifying gene regulatory networks from time course 
microarray data. Bioinformatics, vol. 21, no. 1, page 71, 2005. 



BIBLIOGRAPHY 



151 



